0% found this document useful (0 votes)
530 views

Concepts of Nonparametric Theory (PDFDrive)

Uploaded by

Sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
530 views

Concepts of Nonparametric Theory (PDFDrive)

Uploaded by

Sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 475

Springer Series in Statistics

Advisors:
D. Brillinger, S. Fienberg, 1. Gani, 1. Hartigan
1. Kiefer, K. Krickeberg
John W. Pratt
Jean D. Gibbons

Concepts of
Nonparametic Theory

With 23 Figures

[I] Springer-Verlag
New York Heidelberg Berlin
John W. Pratt Jean D. Gibbons
Graduate School of Business Graduate School of Business
Administration Administration
Harvard University University of Alabama
Boston, Massachusetts 02163 University, Alabama 35486
USA USA

AMS Classification (1981): 62Gxx

Library of Congress Cataloging in Publication Data


Pratt, John W. (John Winsor), 1931-
Concepts of nonparametric theory.
(Springer series in statistics)
Includes index.
1. Nonparametric statistics. I. Gibbons, Jean
Dickinson, 1938- II. Title. III. Series.
QA278.8.P73 519.5 81-8933

© 1981 by Springer-Verlag New York Inc.


Softcover reprint of the hardcover 1st edition 1981
All rights reserved. No part of this book may be translated or reproduced in any form
without written permission from Springer-Verlag, 175 Fifth Avenue, New York, New
York 10010, USA.

9 8 7 6 54 3 2

ISBN-13: 978-1-4612-5933-6 e-ISBN-13: 978-1-4612-5931-2


DOl: 10.1007/978-1-4612-5931-2
To Joy and Jean

To Jack and John


Preface

This book explores both non parametric and general statistical ideas by
developing non parametric procedures in simple situations. The major goal
is to give the reader a thorough intuitive understanding of the concepts
underlying nonparametric procedures and a full appreciation of the properties
and operating characteristics of those procedures covered. This book differs
from most statistics books by including considerable philosophical and
methodological discussion. Special attention is given to discussion of the
strengths and weaknesses of various statistical methods and approaches.
Difficulties that often arise in applying statistical theory to real data also
receive substantial attention.
The approach throughout is more conceptual than mathematical. The
"Theorem-Proof" format is avoided; generally, properties are "shown,"
rather than "proved." In most cases the ideas behind the proof of an im-
portant result are discussed intuitively in the text and formal details are left
as an exercise for the reader. We feel that the reader will learn more from
working such things out than from checking step-by-step a complete presen-
tation of all details.
Those who are interested in applications of nonparametric procedures
and not primarily in the mathematical side of things, but who would like
to have a general understanding of the theoretical bases and properties of
these techniques, will find this book useful as both a reference and a text. In
order to follow most of the main ideas and concepts presented, the reader
should have a good knowledge of the basic concepts of probability theory
and statistical inference at the level of introductory books with a pre-
requisite of one or two years of calculus. More advanced topics require more
mathematical and statistical sophistication. The particularly advanced
sections are indicated by an asterisk and may be omitted. The many exercises

vii
VIII Preface

at the end of each chapter also vary in level, from a straightforward data
analysis to a complicated proof. They are designed to supplement, com-
plement, and illustrate the materials covered in the text. The extensive
references provide ample sources for further study. The non parametric
area is still a fertile field for research, and the interested reader will find no
dearth of topics for further ~tudy; this book might provide an impetus for
additional research in non parametric inference.
The instructor who adopts this book for classroom use can proceed in
various directions and at various levels, as appropriate to the level and
interests of the students. If this course is the student's first exposure to non-
parametric methods, we recommend coverage of selected (unstarred)
portions of Chap. 1-7. If the student has already had an elementary survey
course in non parametric methods, this book can be used for a second course
to provide more advanced material and deeper coverage of the properties of
the procedures already known to the student. In assigning problems, the
instructor should indicate how much rigor is expected in the solution.
Appropriate references could be assigned for reports on selected topics. The
book could be supplemented by outside readings from some of the references
given.
The book does not attempt to provide a complete compendi~m of all
the non parametric methods presently available; only the most important
procedures for testing and estimation that are applicable to the one-sample
and two-sample situations are included. However, those procedures covered
are treated in considerable detail.
This book originated from notes which provided the basis for a course in
non parametric statistics first given in 1959 at Harvard University. Over the
years, many readers have made valuable comments and suggestions. As there
are too many to name individually, we can only acknowledge a large collec-
tive debt to all readers in the past. .
The authors are particularly grateful to the Office of Naval Research,
the National Science Foundation, the Guggenheim Foundation, The
Associates of the Harvard Business School, the Kyoto Institute of Economic
Research, Kyoto University, and the Board of Visitors of the Graduate
School of Business at The University of Alabama, for support; to Robert
Schlaifer for computation of some entries in Tables 8.1 and 11.1 of Chapter 8;
to Arthur Schleifer, Jr. for computation of the entries in Table C; and to
the Literary Executor of the late Sir Ronald A. Fisher, F.R.S., to Dr. Frank
Yates, F.R.S., and to Longman Group Ltd., London, for permission to
reprint Table IIi from their book Statistical Tables/or Biological, Agricultural
and Medical Research (6th edition, 1974).
A Note to the Reader

A two digit reference system is used throughout this book (with the exception
of problems). The first digit denotes a section number within a chapter. For
subsections, equations, theorems, figures and tables within each section, a
second digit is added. If a reference is made to a different chapter, the chapter
number is included, but within the same chapter, it is omitted. Numerical
references, except to equations, are preceded by an appropriate designation,
like Section or Table. Equation numbers always appear in parentheses and
are referred to in this way, e.g., (3.4) of Chap. 2 means Eq. (3.4), which is the
fourth numbered equation in Sect. 3 of Chap. 2. Problems are given at the
end of each chapter; they are numbered sequentially.
Justification of a result, even when entirely heuristic, is sometimes labeled
proof and separated from the rest of the text so that the reader who is not
interested can skip that portion. The end of a proof is indicated by a 0 when
it seems helpful. References in the text are given by surname of author and
date. The full citations for these and other pertinent references are given in
the Bibliography.
Throughout the book, more difficult material is indicated by an asterisk *
at the beginning and end. These portions may be omitted without detriment
to understanding other parts of the book.

IX
Contents

CHAPTER 1
Concepts of Statistical Inference and the Binomial Distribution

1 Introduction 1
2 Probability Distributions 2
3 Estimators and their Properties 6
3.1 Unbiased ness and Variance 7
3.2 Consistency 8
3.3 Sufficiency 8
3.4 Minimum Variance 12
4 Hypothesis Testing 14
4.1 Tests and their Interpretation 14
4.2 Errors 17
4.3 One-Tailed Binomial Tests 22
4.4 P-values 23
4.5 Two-Tailed Test Procedures and P-values 28
4.6 Other Conclusions in Two-Tailed Tests 32
5 Randomized Test Procedures 34
5.1 Introduction: Motivation and Examples 34
5.2 Randomized Tests: Definitions 37
5.3 Nonrandomized Tests Equivalent to Randomized Tests 38
5.4 Usefulness of Randomized Tests in Theory and Practice 39
5.5 *Randomized P-values 40
6 Confidence Regions 41
6.1 Definition and Construction in the Binomial Case 41
6.2 Definition of Confidence Regions and Relationship to Tests in the
General Case 45
6.3 Interpretation of Confidence Regions 46
6.4 True Confidence Level 48

xi
xii Contents

6.5 Including False Values and the Size of Confidence Regions 49


6.6 *Randomized Confidence Regions 51
7 Properties of One-Tailed Procedures 52
7.1 Uniformly Most Powerful One-Tailed Tests 52
7.2 *Admissibility and Completeness of One-Tailed Tests 53
7.3 Confidence Procedures 54
7.4 *Proofs 55
8 Choice of Two-Tailed Procedures and their Properties 58
8.1 Test Procedures 58
8.2 Confidence Procedures 61
8.3 *Completeness and Admissibility of Two-Conclusion
Two-Tailed Tests 63
8.4 *Completeness and Admissibility of Three-Conclusion
Two-Tailed Tests 64
9 Appendices to Chapter I 65
A Limits of the Binomial Distribution 66
A.I Ordinary Poisson Approximation and Limit 66
A.2 Ordinary Normal Approximation and Limit 67
B Convergence in DistributIOn and Asymptotic Distributions 69
B.I Convergence of Frequency Functions and Densities 69
B.2 Convergence in Distribution 72
B.3 Two Central Limit Theorems 73
Problems 74

CHAPTER 2
One-Sample and Paired-Sample Inferences Based on the Binomial
~~~oo ~

I Introduction 82
2 Quantile Values 83
3 The One-Sample Sign Test for Quantile Values 85
3.1 Test Procedures 85
3.2 .. Optimum" Properties 88
3.3 *Proofs 91
4 Confidence Procedures Based on the Sign Test 92
5 Interpolation between Attainable Levels 96
6 The Sign Test with Zero Differences 97
6.1 Discussion of Procedures 97
6.2 Conditional Properties of Conditional Sign Tests 99
6.3 Unconditional Properties of Conditional Sign Tests 101
6.4 *Proof for One-Sided Alternatives 101
6.5 *Proof for Two-Sided Alternatives 102
7 Paired Observations 104
8 Comparing Proportions using Paired Observations 106
8.1 Test Procedure 108
8.2 Alternative Presentations 109
8.3 Example 110
8.4 Interpretation of the Test Results 112
8.5 Properties of the Test 114
Contents xiii

8.6 Other Inferences 114


8.7 Cox Model 116
9 Tolerance Regions 118
9.1 Definition 118
9.2 Practical Uses 119
9.3 Construction of Tolerance Regions: Wilks' Method 121
9.4 Tolerance Regions for Description 124
9.5 Tolerance Regions for Prediction 126
9.6 More General Construction Procedures 127
Problems 130

CHAPTER 3
One-Sample and Paired-Sample Inferences Based on Signed Ranks 145
I Introduction 145
2 The Symmetry Assumption or Hypothesis 146
3 The Wilcoxon Signed-Rank Test 147
3.1 Test Procedure and Exact Null Distribution Theory 147
3.2 Asymptotic Null Distribution Theory 150
3.3 Large Sample Power 151
3.4 Consistency 153
3.5 Weakening the Assumptions 155
4 Confidence Procedures Based on the Wilcoxon Signed-Rank
Test 157
5 A Modified Wilcoxon Procedure 158
6 Zeros and Ties 160
6.1 Introduction 160
6.2 Obtaining the Signed Ranks 162
6.3 Test Procedures 163
6.4 Warnings and Anomalies: Examples 167
6.5 Comparison of Procedures 170
7 Other Signed-Rank Procedures 171
7.1 Sums of Signed Constants 172
7.2 Signed Ranks and Walsh Averages 173
7.3 Confidence Bounds Corresponding to Signed-Rank Tests 174
7.4 Procedures Involving a Small Number of Walsh Averages 175
8 Invariance and Signed-Rank Procedures 177
8.1 Permutation Invariance 177
8.2 Invariance under Increasing, Odd Transformations 179
9 Locally Most Powerful Signed-Rank Tests 181
Problems 185

CHAPTER 4
One-Sample and Paired-Sample Inferences Based on the Method of
Randomization 203
1 Introduction 203
2 Randomization Procedures Based on the Sample Mean and
Equivalent Criteria 205
xiv Contents

2.1 Tests 205


2.2 Weakening the Assumptions 208
2.3 Related Confidence Procedures 209
2.4 Properties of the Exact Randomization Distribution 210
2.5 Modifications and Approximations to the Exact Randomization
Procedure 210
3 The General Class of One-Sample Randomization Tests 216
3.1 Definition 216
3.2 Properties 217
4 Most Powerful Randomization Tests 222
4.1 General Case 222
4.2 One-Sided Normal Alternatives 223
4.3 Two-Sided Normal Alternatives 223
5 Obervation-Randomization versus Rank-Randomization Tests 224
Problems 226

CHAPTER 5
Two-Sample Rank Procedures for Location 231
1 Introduction 231
2 The Shift Assumption 232
3 The Median Test, Other Two-Sample Sign Tests, and Related
Confidence Procedures 234
3.1 Reduction of Data to a 2 x 2 Table 234
3.2 Fisher's Exact Test for 2 x 2 Tables 238
3.3 Ties 241
3.4 Corresponding Confidence Procedures 242
3.5 Power 243
3.6 Consistency 245
3.7 "Optimum" Properties 246
3.8 Weakening the Assumptions 247
4 Procedures Based on Sums of Ranks 249
4.1 The Rank Sum Test Procedure 249
4.2 Null Distribution of the Rank Sum Statistics 252
4.3 Corresponding Confidence Procedures 253
4.4 Approximate Power 255
4.5 Consistency 257
4.6 Weakening the Assumptions 257
4.7 Ties 258
4.8 Point and Confidence Interval Estimation of P(X > Y) 263
5 Procedures Based on Sums of Scores 265
6 Two-Sample Rank Tests and the Y - X Differences 269
7 Invariance and Two-Sample Rank Procedures 269
8 Locally Most Powerful Rank Tests 272
8.1 Most Powerful and Locally Most Powerful Rank Tests Against
Given Alternatives 272
8.2 The Class of Locally Most Powerful Rank Tests 277
Problems 279
Contents XV

CHAPTER 6
Two-Sample Inferences Based on the Method of Randomization 296
I Introduction 296
2 Randomization Procedures Based on the Difference Between Sample
Means and Equivalent Criteria 297
2.1 Tests 297
2.2 Weakening the Assumptions 298
2.3 Related Confidence Procedures 300
2.4 Properties of the Exact Randomization Distribution 301
2.5 Approximations to the Exact Randomization Distribution 302
3 The Class of Two-Sample Randomization Tests 305
3.1 Definition 305
3.2 Properties 307
4 Most Powerful Randomization Tests 310
4.1 General Case 310
4.2 One-Sided Normal Alternatives 311
4.3 Two-Sided Normal Alternatives 312
Problems 314

CHAPTER 7
Kolmogorov-Smirnov Two-Sample Tests 318
I Introduction 318
2 Empirical Distribution Function 319
3 Two-Sample Kolmogorov-Smirnov Statistics 320
4 Null Distribution Theory 322
4.1 An Algorithm for the Exact Null Distribution 323
4.2 RelatIon Between One-Tailed and Two-Tailed Procedures 325
4.3 Exact Formulas for Equal Sample Sizes 325
4.4 Asymptotic Null Distributions 328
5 Ties 330
6 Performance 331
7 One-Sample Kolmogorov-Smirnov Statistics 334
Problems 336

CHAPTER 8
Asymptotic Relative Efficiency 345
1 Introduction 345
2 Asymptotic Behavior of Tests: Heuristic Discussion 347
2.1 Asymptotic Power of a Test 347
2.2 Nuisance Parameters 351
2.3 Asymptotic Relative Efficiency of Two Tests 353
3 Asymptotic Behavior of Point Estimators: Heuristic Discussion 355
3.1 Estimators of the Same Quantity 355
3.2 Relation of Estimators and Tests 357
3.3 "'Estimators of Different Quantities 360
xvi Contents

4 Asymptotic Behavior of Confidence Bounds 362


5 Example 364
6 *Definitions of Asymptotic Relative Efficiency 370
6.1 Introduction 370
6.2 Tests 371
6.3 Estimators 373
6.4 Confidence Bounds 374
6.5 Confidence Intervals 376
6.6 Summary 376
7 Pitman's Formula 377
8 Asymptotic Relative Efficiencies of One-Sample Procedures for
Shift Families 378
8.1 The Shift Model 379
8.2 Asymptotic Relative Efficiencies for Specific Shift Families 379
8.3 Bounds on Asymptotic Relative Efficiencies for Shift Families 388
8.4 *Asymptotically Efficient Signed-Rank Procedures 392
9 Asymptotic Relative Efficiency of Procedures for Matched Pairs 394
9.1 Assumptions 394
9.2 Bounds for Asymptotic Efficiencies for Matched Pairs 397
10 Asymptotic Relative Efficiency of Two-Sample Procedures for
Shift Families 398
II Asymptotic Efficiency of Kolmogorov-Smirnov Procedures 401
11.1 One-Sided Tests 402
11.2 Two-Sided Tests 409
Problems 412

Tables 425
Table A Cumulative Standard Normal Distribution 426
Table B Cumulative Binomial Distribution 428
Table C Binomial Confidence Limits 431
Table D Cumulative Probabilities for Wilcoxon Signed-Rank Statistic 433
Table E Cumulative Probabilities for Hypergeometric Distribution 435
Table F Cumulative Probabilities for Wilcoxon Rank Sum Statistic 437
Table G Kolmogorov-Smirnov Two-Sample Statistic 443

Bibliography 445

Index
455
CHAPTER 1
Concepts of Statistical Inference
and the Binomial Distribution

1 Introduction
Most readers of this book will already be well acquainted with the binomial
probability distribution, since it arises in a wide variety of statistical problems,
is simple to understand and use, and is extensively tabled. Our study of
non parametric statistics will begin with a rather thorough discussion of the
basic concepts of statistical inference, developed and explained in the context
of the binomial model. This approach has been chosen for two reasons. First,
some important non parametric procedures lead to the binomial model, and
the properties of these nonparametric procedures therefore depend on
properties of binomial procedures. Second, the binomial model provides a
familiar and easy context for the illustration of many of the concepts, terms
and notations which are necessary for an understanding of the non parametric
procedures developed later in this book. Some ofthese ideas will be familiar to
the reader, but many belong especially to the area of non parametric statistics
and will require more careful study. The reader may also find that even the
"simple" binomial situation is less simple than it may have seemed on
previous acquaintance.
In this first chapter, after a brief introduction to probability distributions,
we will discuss the basic concepts and principles of point estimation, hypo-
thesis testing and interval estimation. The various inference techniques will be
described, with an emphasis on problems arising in their interpretation. In the
process of illustrating the procedures, we will study many properties of the
binomial probability distribution, including approximations using other
distributions.
2 I Concepts of Stallstlcal Inference and the Binomial DistrIbution

2 Probability Distributions
Suppose that the possible outcomes of an experiment are distinguished only
as belonging to one of two possible categories which we call success and failure.
The two categories must be mutually exclusive, but the terms success and
failure are completely arbitrary and are used solely for convenience. (For
example, if the experiment involves administering a drug to a patient, we
might assign the label" success" to the event that the patient dies. This choice
might be convenient, not merely macabre, because tables are sometimes lim-
ited to the situation where the probability of success does not exceed 0.50.) We
denote the probability of success by p and the probability offailure by q, where
q = 1 - p for any p, 0 :s; p :s; 1. The set of all possible outcomes of this simple
experiment could be written as {Success, Failure}, but it will be more con-
venient to write {l,O} where 1 denotes a success and 0 denotes a failure.
When an experiment of this type is repeated, the trials are called Bernoulli
trials if they are independent and the probability p of success is identical on
every trial. Consider a sequence of n Bernoulli trials where n is fixed. Then the
possible outcomes of this compound experiment can be written as n-tuples

(X1, ... ,Xn) whereXj=O,lfori= l, ... ,n.


The ith component of the n-tuple denotes the outcome of the ith trial. The
set of all possible outcomes contains 2n points. The probability of any
particular point (outcome) whose n-tuple has exactly r l's and n - r O's is
pr(l _ p)n-r

for any r, r = 0,1, ... , n. This probability is the same for every arrangement of
exactly r 1's and n - r O's. There are

(;) = r!(n n~ r)!


different such arrangements, all mutually exclusive.
Let S denote the number of I's which occur in the experiment. Then the
probability that S is equal to r, written as P(S = r), is the sum of the prob-
abilities for all those points with exactly r 1'so Since each of these points has
the same probability, the sum is the number of points multiplied by that
probability, or

This holds for r = 0, 1, ... , n. The probability is zero for all other values of r.
This result is useful whenever we want to distinguish the possible outcomes of
the compound experiment only according to the value of S, i.e., the number of
l's irrespective of the order in which they appear.
2 Probability Distributions 3

More formally, what we have done is to map the set of n-tuples into a set of
nonnegative integers which represent the number of 1's. The function can be
denoted by S(Xl> ... , x n), with a range of {O, 1, ... , n}. The function S is then
called a random variable. This means that it is a function whose domain is the
set of all possible outcomes of an experiment, each outcome of which has a
probability, known or unknown.
This illustrates the usual mathematical definition of a random variable as a
"function on a sample space." Intuitively, a random variable is any uncertain
quantity to which one is willing to attach probability statements. The" sample
space" can be refined if necessary so that all such quantities are functions
thereon. A measurement, or the outcome of an experiment, is a random vari-
able, provided the probabilities of its possible values are subject to discussion.
A random variable may be multidimensional-when several one-dimensional
uncertain quantities are considered simultaneously. Thus a large set of
measurements may be considered a random variable, but as a vector or an
n-tuple.
Any function of a random variable-for instance the sum of a set of measure-
ments-is also ipso facto a random variable. Any function of observable
random variables is called a statistic. For instance, in Bernoulli trials, a
random variable describing the outcome of the ith trial could be defined as

X. = {O if the ith outcome is failure,


, 1 if the ith outcome is success.

A statistic of interest is the number of successes in n Bernoulli trials, which is


the sum of these random variables, Ii Xi = S, since this is the number of
successes, discussed above.
It is often convenient to speak of "observations" rather than "random
variables." The term" observation" is more suggestive of the real world, and
"random variable" of the mathematical definition. In common usage,
observation may be either the observed value of a random variable, or a
random variable which is going to be observed.
Confusion sometimes arises if a random variable is not distinguished from
the actual value observed. This is the distinction between a function and its
value. For example, if the observation S is the number of successes in 10
independent tosses of a fair coin, we may say that
1
P(S = 0) = 1024' (2.1)

If we observe S = 0, however, we cannot substitute 0 for S in (2.1), since in


common sense P(O = 0) '# 1/1024. Similarly, if we observe S = 5, we cannot
substitute 5 for S in (2.1), since P(5 = 0) '# 1/1024. This distinction is not
always as trivial as it seems here, as we shall see later. Nevertheless, we will
sometimes use terminology, such as "observation," which does not make the
distinction, if it does not lead to ambiguity.
4 I Concepts of Statistical Inference and the Blllomllll DIstributIOn

It is conventional to denote a random variable by a capital letter, such as S


above, and an arbitrary value assumed by that random variable as the cor-
responding letter or some other letter in lower case. Thus ifthe random variable
S denotes the number of successes in n Bernoulli trials, we may write

P(S = s) = (:)ps(l - p)n-s for s = 0, 1, ... , n. (2.2)

If S satisfies (2.2) it is said to follow the binomial distribution, or to be


binomially distributed, with parameters nand p. The term parameter here
means any characteristic of the population or theoretical distribution. The
binomial is then a family of distributions, with a particular member of the
family specified according to the values assigned to the parameters nand p.
The letter q is commonly used to denote 1 - p, in (2.2) for instance. While q is
also a parameter, it is known whenever p is known.
The term distribution, or more specifically probability distribution, will be
used in this book for any function which determines all probabilities relating
to a random variable. It could refer to a frequency function (discrete case), a
density function (continuous case), a cumulative distribution function, a
graph of any of these, or something else. The term will be used by itself only
when it does not matter which interpretation is given.
A random variable or its probability distribution is called discrete if it is
confined to a finite or countably infinite number of values whose probabilities
sum to one. Equivalently, the values with positive probability account for all
of the probability (Problem 7). The probability distribution can then be given
by its frequency function (sometimes called a mass function) defined by
f(x) = P(X = x). Thus the binomial distribution is discrete, with frequency
function given by f(x) = (~)px(l - p)n-x for x = 0, 1, ... , nand f(x) = 0
otherwise, as developed in (2.2).
In practice, all observable random variables are discrete, because of
limitations on precision in measurement. However, it is frequently convenient
to use distributions which are not discrete as approximations to the distribu-
tion of, for instance, a very fine measurement, or the sum or average of several
measurements, or an "ideal" measurement.
The simplest kind of nondiscrete random variable is one whose probability
distribution is completely specified by a probability density function. The
values of this density function are not probabilities because the probability
assigned to any particular single value of such a random variable is zero.
Rather, the density function is integrated to find probabilities for sets of values
of the random variable. Hence a real random variable X is said to have a
density f if

P(a ~ X ~ b) = f f(x)dx for all a, b, a ~ b.


2 Probability DIstributIons 5

In particular we must have both!

f(x) ~0 and f:oof(X)dX = 1.

The value of Pea :s; X :s; b) is then the area under the density function f(x)
and above the x axis, between a and b. The area P(X :s; z) for arbitrary z is
shown in Fig. 2.1 as the hatched region. Generalization to vector random
variables is straightforward.

tP(x)

Figure 2.1

The particular density function graphed in Figure 2.1 is called the standard,
or unit, normal density and is given by the formula

,/.,( ) -
'l'X
_ --
1 e -x 2 /2 . (2.3)
J21t
The area under this curve from z to infinity, or P(X ~ z), is given by Table A
for z ~ O. Because the density is symmetric about 0, we have P(X :s; - z) =
P(X ~ z). Thus the probability to the left of a negative number, that is, the
area from minus infinity to - z for z ~ 0, can also be read directly from this
table. If X has a normal distribution with mean Jl and standard deviation (1,
then Z = (X - Jl)/(1 has the standard normal density 4> above.
The cumulative distribution function, or c.d!, of any random variable X,
is defined as F(x) = P(X :s; x), so that

l: f(t) if X has frequency functionf,


F(x) = P(X :s; x) = { rf~:
f(t)dt if X has density function f.
-00

Note that F( - (0) = 0 and F( (0) = 1, while F(a) :s; F(b) for all a :s; b. It is
customary to denote the c.d.f. by the capital of that letter which in lower case
denotes the frequency or density function. The c.dJ. of a discrete random
variable jumps upward by an amount equal to the value of the frequency

I This book omits measurability and" almost everywhere" qualifications. Anyone who ought

to care about them should have no dIfficulty decidmg where they are appropriate
6 I Concepts uf Statistical Inference and the Bmomial DistributIOn

function at each point where the latter is positive; elsewhere it is horizontal.


A random variable is called continuous if its c.dJ. is continuous. This holds if it
has a density. Although there are also continuous random variables without
densities, no explicit examples will arise in this book.
The c.dJ. F of a binomial random variable S satisfies

F(s) = P(S $ s) = ±(~)pi(1


i=O l
_ p)"-i (2.4)

for s = 0, 1, ... , n, and in general for any real number x, we have


o if x < 0
{
F(x) = F([x]) if 0 $ x $ n
1 if x > n
where [x] is the largest integer not exceeding x, or the integral part of x.
Numerical values of the cumulative distribution in (2.4) are given in Table B,
for 2 $ n $ 20 and selected values of p, and for 21 $ n $ 30 when p = 0.50.
Several more extensive tables are available, for instance, National Bureau of
Standards [1949] or Harvard University [1955]. For n > 20 the normal
probabilities in Table A may be used to approximate the binomial distribution
in various ways. One is explained at the end of Table B; the theoretical
relevance of the normal distribution will be given in Sect. 9.

3 Estimators and Their Properties


Suppose we have a sequence of n Bernoulli trials in which the probability p of
success is unknown. Then the distribution of the number of successes belongs
to the binomial family, but it is not completely specified even though we
assume n is known. What can we say about p on the basis of the number of
successes observed? To start with, we could estimate the unknown probability
of success by the proportion of successes observed in the n trials, Sin. This is a
natural estimate, since one intuitive meaning of "probability" is "long-run
value of the observed proportion." (Some people think no other meaning is
appropriate. )
Sometimes it is useful to distinguish between the function determining an
estimate and the actual value of the estimate in a particular situation. Then
the function is called an estimator, and an estimate is an observed value of the
function. Strictly speaking then, Sin should be called an estimator of p; if S = 3
is observed, then the estimate is 31n, and in general if S = s is observed, the
estimate is sin. Thus an estimator is a random variable, or statistic, used to
estimate a parameter, while an estimate is an observed value of an estimator.
What are the properties of Sin as an estimator of p? (We did not say
"estimate," since we cannot say how good an actual estimate is, but we can
say something about an estimator or a method of estimation.) Several
properties will now be discussed.
3 Estimators and Their Properties 7

3.1 Unbiasedness and Variance

One property of the estimator Sin of the parameter p in the binomial dis-
tribution is that its expected value, or mean, is exactly p. Denoting this
expectation by E, we write this statement as

(3.1)

If we want to emphasize the fact that the distribution of S depends on the


parameter p, and hence that the expected value depends on p, we may add a
subscript to E, and write

(3.2)

In general, an estimator T of a parameter () is called unbiased for () if its


expectation is exactly the parameter being estimated, that is, E(T) = (), for
every possible distribution under consideration. Thus, for all binomial
distributions, Sin is an unbiased estimator of p. The unbiasedness of Sin for p
is of little significance by itself. For example, the observation Xi on any
individual trial is also unbiased for p since E(X i) = p. A property which
differentiates these two unbiased estimators is the spread of their values
about p. One measure of spread is the variance, defined in general as

var(T) = E{[T - E(T)]2} = E(T2) - [E(T)Y (3.3)

The variance of Xi is p(l - p), and that of Sin is

varp(~) = p(1 : p). (3.4)

Hence, for any n ~ 2, the variance of Sin is smaller than the variance of the
single observation Xi for all values of p except 0 and 1 (where both variances
are 0).
In fact, the same comparison holds between Sin and any other unbiased
estimator, so that Sin is the unique minimum variance unbiased estimator of p.
This property will be further discussed and proved in Sect. 3.4.
When reporting the value of an estimator T of a parameter (), one should
report also some measure of its spread or an estimate thereof. For this purpose,
it is better to use the square root of the variance, the standard deviation,
because it has the same units as e and T. For many theoretical purposes,
however, like that just mentioned, the variance is slightly more convenient.
Of course, the variance or standard deviation is a measure of the spread of
the estimator around the parameter only if the estimator is unbiased. For an
arbitrary estimator T it is also useful to define the bias as the difference between
8 I Concepts of StatIstical Inference and the Bmomlal DlstnbutlOn

the expected value of the estimator and the parameter 8 being estimated, or
E(T) - 8, and the mean squared error as E[(T - 8)2]. The latter can be
written as
E[(T - 0)2] = E{[T - E(T)]2} + [E(T) - oy = var(T) + [bias(T)]2.
If the bias contributes a negligible proportion of the mean squared error, then
the lack of unbiased ness is of little consequence. Of course the bias, variance,
and expectations above depend on the distribution of T, which mayor may
not be completely determined by 0 under any given assumptions.

3.2 Consistency

In the binomial estimation problem the variance of Sin approaches zero as n


approaches infinity, as is obvious from (3.4) since p is a constant. The combina-
tion of this property with the fact that Sin is unbiased implies that Sin is a
consistent estimator of p. Intuitively, an estimator is consistent if the error in
the estimate probably becomes small as the sample size increases. A more
formal general definition of consistency is that for a sequence of estimators 1'"
depending on n, the probability that the estimator differs from 0, the parameter
estimated, by more than any arbitrary number approaches zero as n ap-
proaches infinity. In symbols, the definition is that for any e > 0, we have
lim P(/T" - 0/ > e) = O. (3.5)
n-+CX)

It is easily shown by application of Chebyshev's inequality (Problem 1) that if


the bias and variance of a sequence of estimators both approach zero as n
approaches infinity, then the sequence is consistent. Unless otherwise stated,
these properties are to hold for each member of whatever family of distribu-
tions is under discussion.

3.3 Sufficiency

It seems natural to assume that the estimation procedure in the binomial


situation should depend only on the number S (or proportion Sin) of suc-
cesses. But why not also take into account which trials are successes 1
Specifically, let Xj = 1 or 0 according as the jth outcome is a success or
failure. The number of successes can then be written as S = Li Xj' Should we
consider as estimates only functions of S alone, such as Sin 1 What happens if
we consider procedures which are functions of X 1, ... , X n , but not necessarily
functions of S1 For example, what about the unbiased estimator (X 1 + X n)/2 1
The following discussion will answer this and related questions.
If the value of S is given, say S = 3, then it is intuitively obvious and easily
proved (Problem 2) that the three successes are equally likely to have occurred
3 Estimators and TheIr PropertIes 9

on any particular three of the n trials. This is true regardless of the value of p.
More generally, given S = s, for any p, the s success are equally likely to have
occurred on each set of s of the n trials. Consequently, once the number of
successes is known, it appears intuitively that no further information is gained
about p by knowing which trials produced these successes. The meaning and
implications of this intuitive idea can be made more explicit as follows.
Suppose that we (the authors) know the outcomes of the individual trials
X 1> ••• , X nwhile you (the reader) know only S. It might seem that we have an
advantage by access to more complete information about the experiment.
Suppose, upon observing S = s, however, that you choose s out of the n trials
at random and arbitrarily call those trials successes and the rest failures,
getting what might be called simulated trials Y1 , •.. , Y". Then, whatever the
value of p, your simulated trials Y1, ••• , y" have the same distribution as the
trials Xl' ... , Xn which actually took place and whose outcomes we know.
(Proof: Whatever the value of p, the X's and the Y's have the same conditional
joint distribution given S; and of course S is common to both sets of trials.
Consequently Y10 ••• , y" have the same unconditional joint distribution as
X 1> •••• XII for every p.)
It is now evident that any inference about p which we can make knowing
Xl' ...• Xn, you can mimic knowing only S. More explicitly, suppose we use a
certain procedure depending on X 10 •••• X n. such as an estimator of P. or an
inference statement about P. or a forecasting statement about future observa-
tions Xn+ 1••••• or a decision rule whose payoff depends on p and/or Xn+ 1"" •
Suppose you use the same procedure, but applied to the simulated trials
Y10 •••• y" in place of Xl • ...• Xn' Then. although we may not get the same
result in a particular instance. your procedure will have exactly the same
probabilistic behavior as ours regardless of the value of p. The probability of
any event defined in terms of Xl' ... , X n depends only on the value of p. The
same event defined in terms of the simulated trials Y1 ••.•• y" will have the
same probability, whatever the value of p, because the Y's have the same
distribution as the X's for all p. If. for instance, we estimate p by a function of
Xl' ... , X n and you estimate p by the same function of Y1, ••• , Y", then your
estimator will have the same bias, variance, and distribution as ours for all p.
In short, our procedure and yours have the same operating characteristics,
where the term operating characteristic means any specific aspect of the
probabilistic behavior of a procedure. Introducing the vector notations X =
(X 10 ••• , X II) and Y = (Y1 , ••• , y") for convenience, we state the following
properties.

(1) X has various possible distributions, one for each value of p.


(2) S is a function of X, namely S = 2:i
Xj'
(3) The conditional distribution of X for a given value of S does not depend
onp.
(4) The conditional distribution of Y for a given value of S is exactly the same
as that of X.
10 1 Concepts of Statisllcallnference and the Bmomlal Distribution

The property of S that made it possible to define Y (Property 3) is called


"sufficiency." In general, a function S of a random variable X is called
sufficient for a family of possible distributions of X, or sufficient for X, if
the conditional distribution of X given S is the same regardless of which
member of the family is actually the distribution of X.
It is possible to define Y so as to duplicate the distribution of X without
knowing which of its possible distributions X has if and only if S is sufficient.
This can be seen by examining the definition of Y. (See the proofs below.)
More explicitly, a function S of a random variable X is sufficient for X if and
only if there exist random variables yes) with completely specified distribu-
tions (which therefore do not depend on the distribution of X), one such
random variable for each value s of S, such that YeS) has the same distribution
as X whichever of its possible distributions X has.
We see in general, therefore, that if S is sufficient for X, then given any
procedure based on X there exists one based on S having the same prob-
abilistic behavior. We simply define Y so as to duplicate the distribution of X
and then apply the procedure to Y instead of X. There is one catch, however~
the definition of Y involves additional randomness, because for a given value
s of S, yes) is a random variable. Thus we must allow "randomized pro-
cedures," which are considered in Sect. 5.
Is there an easy method by which a statistic can be checked for sufficiency?
The following factorization criterion is an almost immediate consequence of
the definition of sufficiency. Suppose that X has the family of density or
frequency functions f(x; e), where, for convenience, e indexes the possible
distributions of X. A function S of X is sufficient for X if and only if f(x; e) can
be written as the product of a function of x alone and a function of Sand e,
that is, if and only if f(x; e) can be factored in the form
f(x; e) = g[S(x); e]h(x) for all x. (3.6)
Here Sex) is the value of S when X has value x.1t is important to note that this
form of the product must hold for all real vectors x. This means in particular
e,
that the domains of the functions may not depend on nor may the region
where hex) i: O.
The factors may be, but need not be, the density or frequency function of S,
say go, and the density or frequency function ofthe conditional distribution of
X given S, say ho. To see the idea, note that always f = goho, and go is a
function of s and e. Thus f(x; e) = go(s; e)ho(x Is, e) where s = S(x). If S is
e
sufficient for X, then by definition ho does not depend on in any way, so it is a
function of x and s and hence a function of x alone, and f = go ho is a factor-
ization of the form (3.6). Conversely, if a factorization of the form (3.6)
e,
exists, then ho does not depend on as can be shown by computing it as
in (3.7)-(3.10) below, so the conditional distribution of X given S does not
depend on O.
Thus we have developed the following three equivalent conditions for
sufficiency of S.
3 EstImators and TheIr Properties 11

(1) f(x; e) factors into a function of Sand etimes a function of x for all real
vectors x.
(2) The conditional distribution of X given S does not depend on e.
e,
(3) There is a (random) function Y of S such that, for all YeS) has the same
distribution as X.
Any of these might be considered a justification for the intuitive explanation
e
that S is sufficient for X ifthe distribution of X depends on only through S, or
e,
S contains all the information about so it is fortunate that they agree. We
will now prove that (1) implies (2), and (2) implies (3). The converse proofs,
and the special case of the binomial, are left for the reader in Problem 3. Some
of these proofs have been given in part already.
PROOFS. To avoid technicalities in the general definition of conditional
distributions, we will assume that X has a discrete distribution with frequency
function f(x; e).
Suppose that ffactors as in (3.6). To show that S is sufficient, we compute the
conditional distribution of X given S = s as follows.

P (X = IS = ) = Po(X = x, S = s)
o x s Po(S = s)
(3.7)
PO(X= x)
{ if S(x) = s,
= oPo(S = s)
otherwise.
Using (3.6), we have
Po(X = x) = f(x; e) = g[S(x); e]h(x) (3.8)
Po(S = s) = L Po(X = x') = g(s; e) L h(x'), (3.9)
S(x')=s S(x')=s
where the sums are over those x' for which S(X') = s. Substituting (3.8) and
(3.9) into (3.7), we find

hex) I if Sex) = s
Po(X = xiS = s) = { Ls(x')=s h(x) (3.10)
o otherwise.
e,
Since the right-hand side does not depend on the conditional distribution of
e,
X given S does not depend on and S is sufficient.
Now suppose that the conditional distribution of X given S does not
depend on e. We will define Y(s) so that it has the same distribution as X for
every e. Let

!sex) = PiX = xiS = s). (3.11)


This is the conditional frequency function of X given S = s, and does not
e,
depend on by assumption.
12 I Concepts of Statistical Inference and the Binomial Dlstnbution

For each s let yes) be a random variable with frequency function J.(x).
We must show that YeS) has the same distribution as X for all e. This follows
from the fact that it has the same conditional distribution given S as X has, for
all e. Explicitly,
P//[Y(S) = xJ = L P//(S = s)p//[y(S) = xiS = sJ
= L P//(S = s)J.(x)
s (3.12)
= L pecs = s)p//(X = xiS = s)

= p//(X = x),
so YeS) has indeed the same distribution as X for all e. o

3.4 Minimum Variance

It was stated in Sect. 3.1 that Sin is the unique, minimum variance unbiased
estimator of the parameter p of the binomial distribution. In this section we
will first discuss this important property briefly and then prove the statement.
In general, an unbiased estimator T of a parameter e is called a minimum
variance unbiased estimator if no other unbiased estimator has smaller
variance for any distribution under discussion, so that T minimizes the
variance, among unbiased estimators, simultaneously for every distribution
of whatever family has been assumed. This sounds like a splendid property for
an estimator to have, and a minimum variance unbiased estimator is ordinarily
a good one to choose. Note, however, that no such estimator need exist.
Furthermore, even if one does, nothing in the definition precludes the
possibility that some other estimator, though biased, has much smaller mean
squared error. Also, a minimum variance unbiased estimate is sometimes
smaller than the smallest possible value of the parameter, or larger than the
largest (Problem 4). When this happens, the estimate seems clearly unreason-
able, and replacing it by the smallest or largest possible value ofthe parameter
as appropriate obviously reduces estimation error, though it makes the esti-
mator biased.
Thus as a concept of optimality in estimation, minimum variance un-
biased ness is not completely satisfactory. But neither is any other concept.
Mean squared error cannot ordinarily be minimized for more than one
distribution at a time (Problem 5). However, seeking a truly satisfactory
concept is taking the point estimation problem too seriously. Formal versions
of it do not correspond at all closely to any real problem of inference. There
is, after all, no need to give just a single estimate, "optimal" or not. (In
making actual decisions, treating a decision as an estimate or vice versa is
more confusing than clarifying.) A full-fledged inference must somehow re-
flect the uncertainty in the situation. An estimate is just an intuitive first step.
3 Estimators and TheIr Properties 13

When formalization leads to difficulty, it is best to give it up and and turn to


other methods permitting richer sorts of inference statements (as we do in
Sects. 4 and 6 below).
Returning to the choice of an estimator in the binomial case, observe that,
since S is a sufficient .statistic, we can restrict consideration to estimators
which are functions of S. This means that, to mimic a given estimator, when
we observe S = s, we may need to choose a distribution depending on sand
draw a random variable Us from this distribution to use as our estimate. To
show we can do better without randomization, let T(s) = E(U s ) for each s.
Then no matter what p is, the nonrandomized estimator T(S) has the same
mean as the randomized estimator Ys , and smaller variance (equal variance
if Us = T(S) with probability one) (Problem 6). Therefore we cannot reduce
the variance by allowing randomized estimators, as one would expect in-
tuitively.
The following proof shows that Sin is the only function of S which is an
unbiased estimator of p. From this and the previous paragraph, it follows that
Sin is the unique, minimum variance unbiased estimator of p.
PROOF. Suppose that two functions of S are both unbiased estimators of p.
Then their difference ~(S) has expected value p - p = 0 no matter what the
true value of p. Therefore
n

Ep[~(S)] =I ~(s)Pp(S = s)
s=o

±~(s) (n)ps(1 - p)n-s =


s=O S
0 (3.13)

for all p. Dividing by (1 - p)n for p f:. 1 and replacing p(1 - p)-l by y gives
the polynomial equation in y

(3.14)

for all y ~ O. But a polynomial vanishes identically (in fact, at more points
than its degree) if and only if all coefficients vanish. Hence

(3.15)

for all s. Since C) > 0, it follows that ~(s) = 0 for all s. This says that the
difference of two functions of S which are both unbiased estimators of the
binomial parameter p is always 0; accordingly, there is only one such function.
D

*The main part of this proof showed that a function of S having expected
value 0 for all p must be identically 0, that is, the binomial family of distri-
butions with n fixed is "complete." A family of distributions is called complete
14 1 Concepts of Statisllcal Inference and the Bmomlal DistributIOn

ifit admits no nontrivial unbiased estimator ofO. (A trivial estimator of 0 is 0


itself or any other estimator which equals 0 with probability 1 under every
distribution of the family.) Once one knows that the family of possible distri-
butions of any random variable X is complete, then any funcion of X is an
essentially unique unbiased estimator of its expected value, for if there were a
non trivially different one, their difference would be a nontrivial unbiased
estimator ofO. *

4 Hypothesis Testing

4.1 Tests and Their Interpretation

As we have already seen, in the binomial situation, a statement about the


parameter p can be made on the basis of the observed S by simply using Sin as
an estimator of p. Another way to make a statement about p is to perform a
statistical test, usually called a significance test or test of an hypothesis. A
statistical test consists of a null hypothesis and a rejection rule.

Null Hypothesis

The null hypothesis (called "null" to distinguish the hypothesis under test
from alternative possible hypotheses) is a statement about the distribution of
the observations. Here it might be that "the number of successes is binomial
with p :s; 0.10," for example. As long as the binomial family is clearly under-
stood in the context of the problem, the statement might be given as simply
"p :s; 0.10." It is customary to denote the null hypothesis by H o.
A distribution of the observations is called a null distribution if it satisfies
the null hypothesis, and an alternative distribution otherwise. Although we
defined a null hypothesis as a statement about the distribution of the observa-
tions, we can also define it as the set of all distributions satisfying that state-
ment, that is, the set of all null distributions. It will be convenient to allow both
usages. An alternative hypothesis may be defined similarly as a set of alternative
distributions. If the null hypothesis completely specifies the null distribution
including all parameters, that is, the set contains only one particular distribu-
tion, then the null hypothesis is called simple. Otherwise, it is called composite.
An alternative hypothesis may also be either simple or composite.

Rejection Rule

The rejection rule is a criterion saying when the null hypothesis should be
rejected; it is "accepted" (see below) otherwise. The rule may depend on the
observations in any way, but must not depend on any unknown parameters.
4 Hypothesis Testing 15

It determines the rejection region, often called the critical region, given usually
as a range or region of values of a test statistic. A test statistic may be any
function of the observations. For example, here the rule might be "reject if
S ~ 3," or "reject if 1Sin - 0.31 > 0.10," when stated in terms of the test
statistics S and Sin. A test is said to be a one-tailed or a two-tailed test based on
a statistic S if it rejects in one or both tails of S, that is, it rejects for S outside
some interval but not for S inside the interval. Each end of the interval may be
closed or open, finite or infinite. (More complicated regions are occasionally
useful, but in this book the term "test statistic" will always imply a one- or
two-tailed test.) The least extreme value of a test statistic in the rejection
region is called its critical value. For instance, if the rejection region is S ~ 3,
then 3 is the critical value of S. A two-tailed test has a critical value in each tail,
called the lower and upper critical values.
It is sometimes convenient to represent a test based on a random variable
X with observed value x by the critical function ¢(x), 0 ~ ¢(x) ~ 1 for all x.
The rejection region corresponds to ¢(x) = 1, that is, those values x for which
¢(x) = 1, while the "acceptance" region corresponds to ¢(x) = O. When
randomization is considered (Sect. 5), the value of ¢(x) is the probability that,
given the observed x, the test will choose to reject the null hypothesis, while
1 - ¢(x) is the probability that it will not. Regions where the test may either
reject or "accept" thus correspond to values x such that 0 < ¢(x) < 1. In
any case, if the distribution of X is F, the probability of rejection by a test with
critical function ¢(x) is E F [ ¢(x)].

Interpretation of Test Conclusions

In an actual application, when the observations are such that the value of the
test statistic lies in the critical region, it is customary to announce that H 0 is
rejected by the test, or that the set of observations is statistically significant
by the test, or simply that the result is significant. In the contrary case, one may
say that the null hypothesis is not rejected by the test, or that the set of obser-
vations is not statistically significant. Of course (Problem 8), a result which is
not statistically significant may still appear significant for practical purposes,
especially if the test is weak, and vice versa, especially if the test is strong
(technically, powerful-see Sect. 4.2).
If a null hypothesis is rejected by a "reasonable" statistical test, then one is
presumably justified in concluding, at least tentatively in the absence of other
evidence, that the null hypothesis is false. (Unfortunately, this statement is
either very vague or merely a definition of "reasonable.") If the null hypo-
thesis is not rejected, this does not ordinarily justify a conclusion that the null
hypothesis is true. We will find it convenient to say that the null hypothesis is
"accepted" whenever it is not rejected, but will use quotation marks to
emphasize that" accepting" the null hypothesis does not justify concluding
it is true in the same sense that rejecting it justifies concluding it is false.
16 1 Concepts of Statistical Inference and the Binomial DistrIbutIOn

"Accepting" the null hypothesis is not tantamount to rejection of all other


possibilities; in fact, it might be considered tantamount to drawing no con-
clusion whatever. Tests are thus intended to make it rare that strong con-
clusions will be drawn prematurely.
For example, suppose that S is the number of successes in 10 Bernoulli
trials, and the null hypothesis H 0: p ;;:: 0.6 is to be tested by the rule" reject if
S ~ 3." If S = 2 is observed, then the test calls for rejection of the null
hypothesis, and we would probably feel justified in concluding that the null
hypothesis is false. On the other hand, if S = 5 is observed, then the
null hypothesis is not rejected. By our definition, H 0 is "accepted." However,
we would not feel justified in concluding that the null hypothesis is true; in
fact, S = 5 seems more evidence that the null hypothesis is false than true.
Would S = 10 justify concluding that the null hypothesis is true? Maybe, but
then we must treat S = 10 differently from S = 5 and use a rule with at least
three possibilities-concluding that the null hypothesis is false, concluding
that it is true, and concluding neither. Such an interpretation will be discussed
in Sect. 4.6. For the present, we are considering rules with only two possibilities,
rejection and .. acceptance." Some 0 bservations in the "acceptance" region
justify no conclusion, and this must therefore be the interpretation of" accept-
ance."
In the authors' opinion, the foregoing interpretation of hypothesis tests is
appropriate as they are typically used in practice, especially in situations calling
for non parametric methods. Some people favor the following, somewhat
stronger, interpretation of" accept." Suppose an action has to be taken, and
only two actions are available, one of which is better if the null hypothesis is
true and the other otherwise. Then one could set up a test and act as if the null
hypothesis were true or false according as the test" accepts" or rejects the null
hypothesis. The interpretation of the test would then be that Ifone had to treat
the null hypothesis as either true or false and a verdict of" no conclusion"
were not allowed, then one would treat the null hypothesis as true if the test
"accepts" it and false if the test rejects it. Sometimes this interpretation is
suggested in terms of conclusions even when no action is required. In practice,
however, tests are seldom chosen appropriately for any such interpretation.
Furthermore, actions are seldom taken as a direct result of tests. When they
are, as in acceptance sampling and quality control procedures, for example,
special considerations (the relative seriousness of the two types of error, if
nothing else) usually have an important bearing on the choice of the rejection
rule.
In any case, according to all usual rationales, the problem should be
formulated so that rejecting the null hypothesis when it is true is a serious
error, at least more serious than the reverse. The test then controls the prob-
ability of such an error. (See Sect. 4.2.) "Preliminary" tests made to check a
model or assumption on which a later primary analysis will be based seldom
satisfy the previous condition. They should be judged on the basis of their
effect on the total analysis and employed, if at all, not in a routine fashion.
4 HypothesIs Testmg 17

There are problems, such as discrimination between two hypotheses or


classification into one of two categories, where the two types of error are
really alike. It is possible but unnatural to adapt the essentially unsymmetric
framework oftesting hypotheses to these situations. If some items may be left
unclassified, the three-decision interpretation of two-tailed tests (Sect. 4.6)
may be relevant.
It is not entirely obvious that tests usually are appropriately interpreted in
any ofthe foregoing ways. This will be discussed later in Sect. 4.4 in connection
with P-values, an alternative method of presenting test results. For further
discussion, both comforting and alarmist, see also, for instance, Cox [1958a],
Kruskal [1968J, Neyman [1950, 1957], Pearson [1962J, Pratt [1965J, and
Savage [1954, especially Sect. 16.3]. See also Kadane [1976], Kempthorne
[1976], Neyman [1976], Pratt [1976], and Roberts [1976].

4.2 Errors

Types of Errors and Their Probabilities

When a statistical test is performed, two kinds of error are possible. We may
reject the null hypothesis when it is true, making a Type I error (or error of the
first kind). On the other hand, we may "accept" (fail to reject) the null hypo-
thesis when it is false, making a Type I I error (or error of the second kind). The
types of errors and correct decisions, which cover all four possibilities, are
shown in the diagram below.

ConclUSIOn

True Situation "Accept" Ho Reject Ho

Ho true No error Type I error


Ho false Type II error No error

If we commit a Type I error, we are concluding that the null hypothesis is


false when it is actually true. With a Type II error, we are drawing essentially
no conclusion when in fact the null hypothesis is false. The two types of error
thus differ in kind, but both are undesirable. Statistics is concerned with
situations where, unfortunately, we cannot be certain to avoid both types of
efror.
The probability of a Type I error is evaluated using a null distribution and
hence is frequently called a null probability, while the probability of a Type II
error is evaluated from an alternative distribution. The calculation is easily
illustrated for the binomial distribution. Suppose we use the rule" reject if
18 I Concepts of Statistical Inference and the Binomial DistributIOn

S ~ 3," where S is the number of successes in 10 Bernoulli trials. Then the


probability of rejection cx(p) depends on p and is given by

3 (10)
cx(p) = P p(S ~ 3) = i~O . - p)n-"
i p'(1 (4.1)

which may be looked up in Table B. The upper curve in Fig. 4.1, labeled S = 3,
shows a graph ofthis probability for all values of p. Ifthe null hypothesis under
test is H 0: p = 0.5, then the probability of a Type I error, rejecting H 0 when it
is true, is simply cx(O.5) on the curve, which is 0.172. On the other hand, if we
test the null hypothesis H'o: p 2 0.6 with this same rejection region, then the
probability of a Type I error is given by that part of the curve cx(p) in Fig. 4.1
where p 2 0.6. Since the curve never rises above 0.055 for p 2 0.6, this prob-
ability is never more than 0.055.
The probability of a Type II error is calculated in a similar manner. Since
the null hypothesis is "accepted" whenever it is not rejected, the probability of
"acceptance" is one minus the probability of rejection. Hence the probability
of" acceptance" in the example of the previous paragraph is given by
1 - cx(p) = 1 - PiS ~ 3)

1.0

0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P

Figure 4.1 P(S ~ s) for n = 10.


4 HypothesIs Testmg 19

where lX{p) is defined as before. This quantity is the probability of a Type II


error for all values of p not in the null hypothesis. Thus, if the null hypothesis
is Ho: P = 0.5, the probabilities ofa Type II error can also be read off Fig. 4.1,
as the complements of the ordinates of the curve for all p 1= 0.5. Similarly, if
the null hypothesis is H~: p ~ 0.6, the probabilities of a Type II error are the
complements of the ordinates where p < 0.6. The maximum (strictly, supre-
mum) probability of a Type II error in this case is 0.945, and it decreases to
zero as p decreases.

Level

Ordinarily, for any statistical test, the probability of rejecting the null
hypothesis depends on the distribution of the observations. It can be cal-
culated as long as this distribution is known. If the null hypothesis is true, this
probability may be a fixed number or may still depend on the distribution of
the observations. If the null hypothesis is simple, like H 0: p = 0.5 above, then
the probability of rejecting the null hypothesis when it is true necessarily has
only one value. If the null hypothesis is composite, then the probability of
rejecting the null hypothesis mayor may not be the same for all distributions
allowed by the null hypothesis. For instance, in the previous example, the
probability of rejecting the null hypothesis H'o: p ~ 0.6 when true depends on
p.
If for a test of a particular null hypothesis, simple or composite, the
probability of a Type I error is less than or equal to some selected value IX for
all null distributions, then the test is said to have level IX or to be at level IX
(0.05 and 0.Q1 are popular values for IX). The level may be described as con-
servative to emphasize that this kind oflevel is meant rather than nominal or
exact level (defined below). Thus the test above, which rejects the null hypo-
thesis H'o: p ~ 0.6 when S :::; 3, has level 0.10. It also has level 0.08 and level
0.06. It does not quite have level 0.05, however, because there is a distribution
satisfying the null hypothesis and giving a probability of rejection greater
than 0.05; for example, p = 0.6 gives probability 0.055 of rejection. If a test
has level IX, it is natural to say also that the level of the test is IX. The word" the"
here is somewhat misleading, though not seriously so in practice, since, as the
foregoing example illustrates, the level of a test is not unique.
If a test of the null hypothesis H'o: p ~ 0.6 is desired at level 0.10, the test
rejecting when S :::; 3 might be selected. The number 0.10 would then be called
the" nominal level " of the test; 0.055 is called the" exact level" because it is
the maximum probability of rejection for p ~ 0.6. In general, the nominal level
of a test is the level which one set out to achieve, while the exact level is the
actual maximum probability of rejection under the null hypothesis. The
exact level is the smallest conservative level the test actually has. It would
perhaps be simpler mathematically to define only exact levels, but conservative
20 I Concepts of Statistical Inference and the Binomial DIstributIOn

and nominal levels are needed in practice and must be discussed, and it is
convenient to have names for them.
The term size is sometimes used instead of level, but it will not be used in
this book. Connotatively, "size" seems to place more emphasis on the rejection
region, and less on the null hypothesis, than the word "level." A test is some-
times called valid at level (X if it has level (X; this terminology is especially useful
when a null hypothesis has been changed or broadened. Sometimes significance
level is used instead of simply level, to distinguish it from confidence level
which will be defined in Sect. 6.

Interpretation of Level

The level (X of a test can be interpreted as bounding the probability of drawing


an erroneous conclusion, if failing to reject the null hypothesis is regarded as
drawing no conclusion and hence as not drawing an erroneous conclusion.
This does not mean, however, that a conclusion to reject the null hypothesis
for one particular set of observations has probability at least 1 - (X of being
correct. A particular conclusion is not a random variable, but rather is the
value of a random variable. If only an .. objective" or "frequentist" concept of
probability is used, the probability that a conclusion to reject is correct once
the observations have been made must be either 0 (if the conclusion is
incorrect) or 1 (if the conclusion is correct), and this probability is unknown
(except in the trivial cases where the conclusion is either impossible or surely
true).
The probability that any particular conclusion is correct can be computed
if the Bayesian philosophy of statistics is adopted and the additional input it
requires is provided. This philosophy favors using probability to measure the
uncertainty about any quantity which cannot be determined exactly from
available evidence, including the true p in the binomial situation. Probability
is then interpreted as subjective (more rarely objective) but rational degree of
belief. A "prior" probability distribution must be assigned to the parameters
before observation. It is then straightforward to compute the "posterior"
probability, given the observations, that any possible conclusion is correct.
The results may be more or less influenced by the "prior" distribution,
depending on how definitive the data are. Strictly, the prior distribution should
represent prior judgment, but in practice may be chosen conventionally as
a reference point for inferences. There are strong arguments in favor of the
Bayesian philosophy which will not be discussed here (e.g., Savage [1954];
Lindley [1971]; Pratt et al. [1964]). It is a disquieting warning about tests of
hypotheses to find that they rarely seem appropriate from a Bayesian point of
view, and that in most situations, the level of a test is not closely related to the
posterior probability of a correct conclusion (Pratt [1964]; see also the dis-
cussion of the interpretation of P-values in Section 4.4).
4 Hypothesis Testing 21

Power

The probability of rejection when the null hypothesis is false also depends on
the distribution of the observations, and is called the power of the test. Power
is evaluated using an alternative distribution. If the alternative is simple, the
power is a single number; otherwise it is a function.
Consider again, for instance, the test above for n = 10, which rejects H'o:
p ~ 0.6 when S ~ 3. Its power against the alternative p < 0.6 is given by the
function (4.1) for values of p < 0.6. The power curve is then represented by
that portion of the curve in Fig. 4.1 for which p < 0.6. Specifically, the
power of the test is 0.172 when p = 0.5; 0.650 when p = 0.3; 0.987 when p =
0.1; etc. Recall that the remaining portion of the curve, where p ~ 0.6,
represents the probability of a Type I error, and the complements of the
ordinates for that portion where p < 0.6 represent the probability of a Type II
error. Clearly, the power is always one minus the probability of a Type II
error.

Use of Power in Choosing a Test Statistic

The power should be considered, at least informally, in choosing a test


statistic and the significance level. Even after a test statistic has been chosen,
the power can be increased by changing the critical value, but the probability
of a Type I error will also increase. Thus the power and significance level, or
the probabilities of the two types or errors, must be traded off against each
other. In principle, both of these must also be traded off against the size and
kind of experiment if they are subject to choice (Pearson [1962]). All these
tradeoffs are very difficult to make, especially in a "frequentist" framework,
and are consequently more honored in theory than in practice. It can be
argued that varying the significance level according to circumstances impedes
communication and that conventional levels provide an appropriate tradeoff
between "noise" in the knowledge system and suppression of valid, useful
conclusions (Bross [1971]).
Even when a God-given nominal level is to be used, if the test statistic is
discrete, power considerations should enter the decision of whether to use a
conservative test (unless controlling the Type I error is considered para-
mount, in which case why not make it infinitesimal ?). For example, if the
nominal level of a test is 0.05 and the two choices of exact levels nearest 0.05
are 0.001 and 0.06, the power of the conservative test will generally be much
smaller than the power of the test at exact level 0.06. The latter test may then
be the more desirable one. The practice of reporting the P-value, as explained
in Sect. 4.4, partly sidesteps this kind of decision.
Power is also the basis for choosing among different types of test or test
statistics. Different tests may have relatively large or small power against
different alternatives. Hence in choosing among them, one might favor a test
22 I Concepts of StatIstical Inference and the Binomial DistributIon

which provides large power against those alternatives which are of particular
interest because of their practical importance or (" subjective!") probability
of occurrence.
Power comparisons of different tests should ordinarily be made at the same
exact levels. Otherwise they may be seriously misleading, because the power of
a test can be increased by increasing the probability of a Type I error. Con-
fusion generally attends comparisons of tests which have the same nominal or
conservative levels but different exact levels.
Two tests, say A and B, are called equivalent if test A rejects if and only if
Test B rejects. Equivalent tests necessarily have the same exact level and the
same power against all alternatives, but the converse is not true (Problem 11).
Correspondingly, two test statistics are called equivalent if any test based on
either statistic is equivalent to a test based on the other. This holds whenever
the test statistics are strictly monotonically related (Problem 12).

4.3 One-Tailed Binomial Tests

The null hypothesis P ~ 0.6 and the alternative P < 0.6 are each one-sided, in
an obvious sense. The test which rejects when S :::;; 3 is also called one-sided,
or one-tailed. Explicitly, it is called lower-tailed or left-tailed since the
rejection region is at the lower end of the range of the test statistic S.
More generally, suppose S is the number of successes in n Bernoulli trials
and we wish to test the null hypothesis P = Po or P ~ Po against the alternative
P < Po· One rule for testing either of these null hypotheses is to reject when
S :::;; sc, where the critical value Sc is chosen so that the level of the test is some
preselected number rx.. Let Sc be the largest integer possible, subject to the
restriction that the left-tail probability peS :::;; sc) is less than or equal to rx.
when P = Po. This critical value is easily found from Table B. Algebraically
it is the largest integer Sc for which

(4.2)

does not exceed the nominal level. (The subscript on P indicates that the
probability is to be computed for P = Po.) Given SC' a simple comparison of
the observed value of S with Sc determines whether the observations are
significant or not.
For the null hypothesis P = Po, the exact level of this test is, of course, the
exact value of the probability in (4.2). Furthermore, it has the same exact level
for the null hypothesis P ~ Po (with of course the same power against alter-
natives P < Po). Intuitively, this is because testing the null hypothesis P ~ Po
against alternatives P < Po is the same as testing the" least favorable case;'
P = Po against alternatives p < Po. For the binomial distribution, this
intuition is correct (see Problem 13).
4 HypothesIs Testmg 23

The power of this test, namely Pis ~ sc) where P is any value less than Po,
is also easily found from Table B, and can be expressed algebraically as

PiS ~ sJ = f: (~)pi(1 -
i=O I
p)"-i for any P < Po. (4.3)

What about a one-sided hypothesis testing situation in which the alternative


lies on the other side of the null hypothesis? The null hypotheses p = Po and
p ~ Po may be tested against the alternative p > Po at level oc by rejecting
when S ;;:: Sc> where Sc is chosen to be the smallest integer possible subject to
the restriction that P(S ;;:: sc) does not exceed oc when p = Po. Thus Sc is the
smallest integer such that

Ppo(S ;;:: sJ = t
'='c
(~)p~(l
I
- PO)"-i ~ oc (4.4)

and again Table B can be used. For example, when n = 10 and the alternative
is p > 0.6, rejecting the null hypothesis p = 0.6 or p ~ 0.6 when S ;;:: 9 gives
an exact level of 0.046. This is specifically called an upper-tailed or right-tailed
test procedure. In fact, these testing problems and procedures correspond to
the one-sided binomial testing problems and procedures discussed earlier by
the simple exchange ofthe definitions of" failure" and" success." For instance,
previously we tested the null hypothesis that the probability of success is equal
to 0.6 against the alternative that it is less than 0.6. Using the rule" reject when
the number of successes is 3 or fewer in 10 trials," the exact level was 0.055.
This is precisely the same as testing the null hypothesis that the probability of
failure is equal to 0.4 against the alternative that it is greater than 0.4, by
rejecting if there are 7 or more failures in the 10 trials. Since" success" and
"failure" are completely arbitrary designations anyway, we can rename the
failures" successes." This is therefore a right-tailed test, and properties of the
two types correspond, with the exact level again 0.055.

4.4 P-values

Definition of P-values

We now discuss an important variation in the method of carrying out a test


and reporting the result. The procedure we have described for a conservative
test is to select a critical region in such a way that the probability of a Type I
error is not more than some nominal level chosen in advance. We then report
whether or not the observations are significant at the particular level chosen.
We could instead, however, find and report the smallest level at which the
observations are significant, the level of just significance or the critical level,
also called the P-value. If this value is smaller than the nominal level, the
observations are significant, and otherwise not significant, according to the
24 I Concepts of Statistical Inference and the Binomial Distribution

procedure just described. However, if no decision is actually required, it is not


necessary to choose a nominal level or form a specific rejection rule; signific-
ance is a secondary question and reporting the P-value is more informative
and avoids arbitrary choices. Ordinarily the P-value can be found as a tail
probability computed under the null hypothesis using the observed value of
the test statistic. Hence the name P-value. These ideas will now be explained
in more detail.
As an example, in the binomial case with 10 trials and Po = 0.6, S = 3 is
significant by all lower-tailed tests with critical value Sc ~ 3 but not by those
with Sc ::;; 2. Since from Table B we have

P O.6 (S ::;; 2) = 0.0123


PO.6(S ::;; 3) = 0.0548,
we see that S = 3 is significant by tests at all levels 0( ~ 0.0548, but not at
levels IX < 0.0548. Thus 0.0548 is the smallest level at which S = 3 is significant.
Similarly, S = 2 is just significant at the 0.0123 level, S = 4 at the 0.1662
level, etc.
More generally, by a lower-tailed test for H 0: P = Po or H'o: P ~ Po in a
binomial problem, a value S = s is just significant at the level .
(4.5)

In other words, the borderline level of significance is the tail probability of s or


less under Po. This is the P-value.
To generalize to an arbitrary situation, suppose that the possible outcomes
x can be ordered according to how" extreme" they are, as by a test statistic,
and that critical regions at all possible significance levels are the tails of this
ordering. If a value x is observed, the smallest critical region to which x
belongs is the tail which just includes x. Hence the smallest significance level
at which x is significant equals the maximum probability, under null distribu-
tions, of the tail which just includes x. This maximum probability is called the
P-value (or exact P-value or P-value at x). It is also frequently called the tail
probability including x, or the associated probability at x, since it represents
the probability associated with an outcome equal to or more extreme than
that actually observed. Of course, it is also the critical level, and if a decision
about significance were required, the outcome x would be judged significant
at all levels greater than or equal to the P-value at x, but not significant at
smaller levels.
When the possible outcomes are ordered as above, significance at one level
implies significance at all larger levels. Conversely, suppose one has a test for
each level 0(, 0 < 0( < 1, and significance at one level implies significance at all
larger levels. Then the outcomes can be ordered according to the levels at
which they are just significant and the critical regions are the tails of this
ordering. Furthermore, the P-value itself may be viewed as a test statistic on
which the tests are based.
4 HypothesIs Testmg 25

For some well-behaved problems, the P-value as a tail probability and the
critical level as the level of just significance can be sensibly defined and are
equal. This value can therefore be reported and provides more information
than a statement that the observations are or are not significant at a pre-
selected level. It is possible in some problems, even by some kinds of" opti-
mum" tests, for a set of observations to be significant at one level but not at a
larger level, for instance at the 0.01 level but not at the 0.05 level. Then rejection
regions at different levels would not be nested, and P-values and critical levels
would be difficult to interpret, even if they were defined. No such situations
arise in this book; Chernoff [1951] gives an example which illustrates the
pathology.

Interpretation of P-values

The P-value may perhaps be interpreted as a kind of measure of the extent to


which the observations contradict or support the null hypothesis. One must
be cautious about such an interpretation, however. It certainly is true, but
rather trivial, that in a single experiment, the more extreme the P-value the
more extreme the outcome, and hence the more the null hypothesis is con-
tradicted, as long as "extreme" is properly defined. But this would also be
true for any strictly monotonic function of the P-value, and in particular, for
the test statistic itself.
The real question is, can one compare P-values across sample sizes or
experiments? The answer, unfortunately, has to be that such comparisons
usually have little meaning. We mention three justifications for this assertion.
First, the more powerful a test is, the more extreme a P-value is to be expected
if the null hypothesis is false, and hence the less a given P-value contradicts
the null hypothesis. Second, convincing arguments both within and outside
the" objective" or "frequentist" theory of probability show that the extent of
contradiction of the null hypothesis is not determined by the P-value, but
rather by the likelihood function, a very different animal. (See Birnbaum
[1962].) Third, P-values can be very discordant with the direct answers to
questions of interest when such answers are available, as they are in the
Bayesian framework once the relevant prior distributions have been provided
(see Interpretation of Level in Sect. 4.2).
Often a null hypothesis such as p = Po is almost certainly not literally true.
Then the P-value is obviously not a useful measure of the plausibility of this
literal null hypothesis. Usually the question of real interest relates to whether
such a null hypothesis is nearly true. In the Bayesian framework, however, the
"posterior" probability of this, given the observations, may vary widely,
depending on the sample size and the problem, for a fixed P-value and a
fixed "prior" probability before observation. Even in practical problems, if a
null hypothesis analogous to p = Po has prior probability close to 0.5 of being
nearly true, the posterior probability after observation may well be as small
26 I Concepts of Statistical Inference and the Bmomlal Dlstnbution

as 6 times or as large as 12 times the P-value for P-values between 0.001 and
0.05, although it is seldom less than 3 times or more than 30 times the P-value
(according to Good [1958]). These figures are rough, but based on consider-
able though unpublished evidence. See also Jeffreys [1961] and Lindley
[1957]. For interesting examples with discussion, see Good [1969], Efron
[1971J, and Pratt [1973]. In this framework then, if the value of a test
statistic is just significant at the 0.05 level, there is still a substantial chance
(at least 0.15) that the null hypothesis is nearly true. This suggests that bare
significance at the 0.05 level, a P-value just below 0.05, is at best not a very
strong justification for concluding that the null hypothesis is appreciably false.
Of course, significance substantially beyond the 0.05 level is another matter.
We note that this illustrates again the disadvantage of a mere statement of
significance or nonsignificance.
In the special case of a one-tailed test of a truly one-sided null hypothesis,
such as P :s;; Po (not P = Po) or a multiparameter analogue, the P-value may
often be expected to be close to the posterior probability of the null hypo-
thesis (see, e.g., Pratt [1965]). It can be argued that In all other cases, both
P-values and tests should be interpreted with great caution.

Discreteness and P-values

We now discuss some difficulties connected with discrete distributions. In


binomial problems, and many others, it is customary to use conservative tests,
that is, to choose the critical value so that the exact level is as near the
nominal level as possible subject to the condition that it be no greater. (Most
available tables and computer programs giving critical values are so con-
structed.) With this custom, an outcome is considered significant at the
nominal level 0( if and only if the exact P-value is less than or equal to 0(.
An alternative method of selecting critical values in discrete problems is to
choose them so that the exact level is as near the nominal level as possible on
either side (greater or smaller). Some tables and perhaps computer programs
use this criterion. It should be remembered that whenever the exact level is
larger than the nominal level, this procedure has greater power than a
conservative test.
Consider, for instance, the previous binomial example where n = 10 and
the null hypothesis is p ;:::: 0.6. If the nominal level is 0.05, the conservative
critical value is Sc = 2. The exact level nearest the nominal level is given by
Sc = 3, since

P O.6 (S :s;; 2) = 0.0123 and PO.6 (S:s;; 3) = 0.0548


and the latter is nearer 0.05. The probability of rejection for these two tests is
graphed in Fig. 4.1 and shows how much greater is the power of the test using
Sc = 3.
4 HypothesIs Testmg 27

When the exact level nearest the nominal level is used, what is the border-
line level of significance? Suppose, for example, that S = 3 is observed. Then
the exact P-value is 0.0548. The probability P O. 6 (S :s; 2) = 0.0123 will be
called the next P-value. The average of these two numbers is (0.0548 +
0.0123)/2 = 0.0336, called the mid-P-value; this is the borderline level of
significance since a nominal level greater than 0.0336 is nearer to 0.0548 than
to 0.0123, while one less is nearer to 0.0123.
For a test of the null hypothesis P 2: Po in a binomial problem, by the
same rule, S = s is significant at nominal levels greater than, and not sig-
nificant at nominal levels smaller than, the mid-P-value
P po(S :s; s) + P po(S < s) (4.6)
2
In general, as long as the possible outcomes x can be ordered according to
how extreme they are, the mid-P-value is defined as the arithmetic average of
the exact P-value and the next P-value, and is the borderline level of signific-
ance 2 or critical level according to the rule of" exact level nearest the nominal
level." Here the next P-value, also called the tail probability beyond x, is the
maximum probability under null distributions of an outcome more extreme
than x (see Lancaster [1952] for further discussion).

Summary Recommendations

As mentioned earlier, reporting the P-value for the outcome observed gives
more information than a report of simply significant or not significant, and in
effect permits everyone to choose his own significance level.
For test statistics with discrete null distributions, however, there is still the
question of whether to report the exact P-value or the mid-P-value when they
differ appreciably. This seems a matter of taste, as long as it is made clear
which is being done. If there is a custom for the particular type of problem, this
should be followed. For some audiences, it may be desirable to report both the
exact and next P-values (the tail probabilities including and beyond x).
Approximations based on continuous distributions generally approximate
the mid-P-value rather than the exact P-value unless a "correction for
continuity" is made. This sometimes makes the mid-P-value a little more
convenient to compute. Some people, especially ifthey believe that the P-value
can be given a precise interpretation, may feel that some one number should
be chosen for the purpose and that there are fundamental grounds for choice
between the exact P-value and the mid-P-value (or something else). See
the discussion of randomized P-values in Sect. 5 and, for instance, Lancaster
[1961].

2 Whether the outcome would be slgmficant at precIsely this nommallevel depends on whether
one chooses the larger or smaller when the nommal level is halfway between the two nearest
exact levels.
28 I Concepts of Statistical Inference and the Bmomlal Distribution

Even though it is not appropriate to interpret a P-value as more than a


measure ofthe extent to which the observations contradict or support the null
hypothesis in a single experiment, the method is well justified and advised on
the grounds that it contains information about the experimental results which
is not reflected in a simple statement of significance at a preselected level.
P-values are discussed further in Gibbons and Pratt [1975].

4.5 Two-Tailed Test Procedures and P-values

Tests Against Two-Sided Alternatives

In the binomial problem, how should we test the null hypothesis P = 0.6
against the alternative p ¥ 0.6? This alternative is a combination of the two
alternatives p < 0.6 and p > 0.6, and is two-sided in an obvious sense. A test
might be performed by combining the left-tail and right-tail tests discussed
previously. Thus, for n = 10, one might reject when S :5; 3 and also when
S ~ 9. Since the two tails are mutually exclusive, the exact level of this test can
be computed as the sum of the two tail probabilities under the null hypo-
thesis p = 0.6. These two tail probabilities, the exact levels of the two one-
tailed tests, are respectively 0.055 and 0.046. The exact level of this two-tailed
test is thus
PO.6 (S:5; 3 or S ~ 9) = PO. 6 (S :5; 3) + PO. 6 (S ;;:: 9)
= 0.055 + 0.046 = 0.101.
Similarly, given that the nominal levels of the two individual tests were both
0.10, the nominal level of this two-tailed test is 0.20.
In general, it is always true that combining individual tests at levels <Xl>
<X2' ••• for the same null hypothesis H o gives a test at level <Xl + !X2 + ... for Ho.
The exact level of the combined test is the sum of the exact levels of the in-
dividual tests if the null hypothesis is simple (or the same distribution is
"least favorable" for all tests) and the tests are mutually exclusive (that is, no
possible set of observations is rejected by more than one of the tests). Other-
wise the exact level may be less than (but cannot be more than) the sum
(Problem 19).
For a binomial null hypothesis H 0: p = Po, the standard two-tailed test at
level <X rejects when either one-tailed test at level <x12 rejects. Thus, specifically
it rejects if S :5; St or if S ~ Su' where St is the largest integer which satisfies

(4.7)

and Su is the smallest integer which satisfies

(4.8)
4 HypothesIs Testing 29

While this test has nominal level IX, its exact level is the sum of the actual
values of the left-hand sides of Eqs. (4.7) and (4.8). Similarly, its power
function is the sum
PiS:$; St) + P p(S ;::: s,J,
calculated under any alternative P =1= Po.
It is not automatic that a test against a combined alternative should be
constructed by combining the tests which would be chosen for the individual
alternatives, nor even that a test against a two-sided alternative should
be two-tailed in form just because a one-tailed test would have been used
with each of the one-sided alternatives. It can be shown, however, that
two-tailed tests are (in a sense to be made precise later) the only ones that need
be considered against two-sided alternatives in binomial problems and indeed
in most practical problems. (See Sect. 8.3.)
Even if attention is restricted to two-tailed tests, the critical values St and
Sa can be chosen in ways other than that above. They may be critical values
for anyone-tailed levels 0(1 and 0(2 such that 1X1 + 0(2 :$; IX. In other words, they
need only satisfy
(4.9)
Various possibilities, including an "optimality criterion" which is really a
convention, will be discussed later, when the properties of two-tailed tests are
investigated in Sect. 8. It is difficult, however, to give a convincing justification
in the frequentist framework for choosing a particular two-tailed test among
those at level IX.
For two-tailed tests, then, we have the problem of selecting not only a
significance level Ct, but also the upper and lower critical values, or 0: 1 and 0: 2 ,
for a given significance level. For the latter, like the former, except by adoption
of some convention, the usual methodology of hypothesis testing tells us only
to" look at the power functions and make a choice," but sheds no light on how
to do so. It is easier said than done.

P-valuesfor Two-Tailed Tests

For two-tailed, as for one-tailed, tests, we can avoid the problem of choosing a
significance level by reporting a P-value. Unfortunately, however, the very
definition of the P-value for two-tailed tests presents a problem equivalent to
that of choosing 0:1 and Ct2 for a given significance level. (This problem has no
counterpart for one-tailed tests.)
One possibility is to report the one-tailed P-value even for a two-tailed test,
and remark that the two-tailed P-value, while depending on what kind of two-
tailed critical region would have been formed, is presumably about twice as
large as the one-tailed P-value reported. Some people go further and claim
that P-values are not appropriate in two-sided situations, but that seems an
30 1 Concepts of StatIstIcal Inference and the Bmomlal DIstributIOn

inappropriate dismissal of a problem which is not trivial and should be


examined.
To obtain a precise two-tailed P-value, one would have to add the prob-
ability of a value equal to or more extreme than that observed in the same tail
and some probability from the opposite tail. However, a single observed result
can give various P-values depending on what probability is added to represent
the other tail. The most common procedure is to report a two-tailed P-value as
twice the one-tailed P-value, but there are other possibilities.
To simplify the discussion, we assume that the test statistic has only one
relevant null distribution, either because only one is possible or because its
other possible null distributions have tail probabilities at least as small. When
this null distribution is symmetric, like the binomial for p = 0.5, all procedures
lead to a two-tailed P-value which is twice the one-tailed P-value.
When the null distribution is not symmetric, doubling the one-tailed
P-value corresponds to the test with conservative level 11./2 in each tail, as at
(4.7) and (4.8). If the null distribution is discrete, the exact level of this pro-
cedure will ordinarily be strictly less than II. (Problem 20), clouding the inter-
pretation ofthe P-value. A modification which avoids this problem is to define
the two-tailed P-value as the sum of the one-tailed P-value and the largest
attainable probability in the other tail which does not exceed the one-tailed
P-value. This is the exact level if the foregoing test is carried out at a nominal
level equal to twice the one-tailed P-value. It is also the exact P-value of a two-
tailed test in which the values of the test statistic are ordered according to their
one-tailed P-values (Problem 21). If instead we add the nearest attainable
probability in the other tail, whether larger or smaller than the one-tailed
P-value, two points are often added to the rejection region simultaneously,
unnecessarily reducing the number of exact levels available, and the resulting
P-values need not correspond to any test procedure (Problem 22).
Another approach might be called the" principle of minimum likelihood."
If the value S = s is observed, the P-value at s is found by summing the null
probabilities of all values of S which do not exceed the probability P(S = s).
If S has a density / under the null hypothesis, the P-value is the probability of
the region where the density is /(s) or less. This gives a two-tailed probability
as long as the null distribution is unimodal. It is equivalent to ordering the
values of S according to its mass or density function under the null hypothesis.
It corresponds to the most powerful test against the alternative that S is
uniformly distributed.
Still another possibility is to locate the cutoff of the two tails at an equal
distance from some specified location parameter p, of the null distribution,
such as the mean, median, or mode. This corresponds to ordering values of the
test statistic S according to IS - p,1. Then the P-value is P(S ~ s) + PES ::s;
p, - (s - p,)] when s is in the upper tail. This procedure seems appealing if one
interprets the P-value as a measure of agreement or disagreement between the
observed value of the test statistic and some central value under the null
hypothesis. Unfortunately, some sort of skewness correction is needed for
4 Hypothesis Testmg 31

severely asymmetric null distributions, just those where the choice of pro-
cedure matters most.
We illustrate these procedures in the binomial case with n = 10 and Ho:
p = 0.6. For convenience, the point probabilities for Sunder Ho are given in
Table 4.1.
Suppose that S = 3 is observed. The appropriate one-tailed P-value is
lower-tailed, and P 0.6(S :$ 3) = 0.055. One procedure is simply to report this,
with the comment that the two-tailed P-value is presumably about 0.110.
This is the borderline level of significance if the standard two-tailed test with
levell'J./2 in each tail is used. Since no upper tail probability equals 0.055, the
first procedure suggested above would add the next smaller upper-tail
probability, which is P O. 6 (S 2 9) = 0.046, and report a two-tailed P-value
of 0.055 + 0.046 = 0.101. This is the exact level of the standard two-tailed test
at the borderline nominal level 0.110. It is also the borderline level of signific-
ance of a test based on the one-tailed P-value. The nearest attainable upper-
tail probability is the next smallest in this case and hence gives the same
result.
For the minimum likelihood procedure, the values of S with probability
smaller than S = 3 under the null hypothesis are 0, 1, 2, 9, and 10. Hence the
P-value is PO. 6 (S :$ 3) + P O. 6 (S 2 9) = 0.101 again.
Since the mean, median, and mode of the null distribution each equal 6,
locating the two tails at equal distances from any of these also gives the same
result of 0.101. Using the midrange of 5, however, gives a P-value of PO. 6 (S:$ 3)
+ P0.6(S 2 7) = 0.055 + 0.382 = 0.437. If the observed value of S had been
in the upper tail, this procedure with the midrange would have given a two-
tailed P-value smaller than twice the one-tailed P-value in this example since
the null distribution is skewed left here.
We mention one other procedure that could be used for statistics with a
finite range. It places an equal number of possible values in each tail if the
distribution is discrete. If the possible values are equally spaced, this also makes
the tails equal in length, and this is the procedure used for continuous
distributions also. Except for discrete distributions with unequally spaced
values, which are unusual in practice, this procedure is equivalent to locating
the tails at equal distances from the midrange, as in the example of the pre-
vious paragraph. Unfortunately, this procedure not only is restricted to
statistics with a finite range, but also, even when defined, can give absurd
results if the null distribution is highly skewed. For example, in the binomial
case with H 0: p = 0.1, suppose that S = 7 is observed with n = 10. The one-
tailed P-value is then P0.1 (S 2 7) = 0.000. When an equal number of extreme

Table 4.1 Binomial Probabilities for p = 0.6, n = 10

sOl 2 3 4 5 6 7 8 9 10
Po 6(S = s) 0.000 0002 0.010 0.043 0.111 0.201 0.251 0.215 0.121 0.040 0.006
32 I Concepts of Statistical Inference and the Binomial DiStributIOn

values are placed in the lower tail, the two-tailed P-value becomes P O• 1 (S ::;; 3)
+ P O. 1(S ~ 7) = 0.987. Even though S = 7 strongly contradicts H o, a
P-value of 0.987 would lead to the conclusion that Ho is highly "acceptable."
Intuitively, the extent to which the data contradict a null hypothesis should
not change sharply if the hypothesis is changed slightly. However, for all of
these procedures except doubling the one-tailed P-value, a slight change in the
null hypothesis can lead to a sharp change in the P-value because the P-value
is a discontinuous function of the null hypothesis (Problem 23). The authors
consider this counterintuitive property less a flaw in the methods of forming
two-tailed P-values than a symptom of the fundamental difficulty of inter-
preting P-values as measuring the support or contradiction of the null
hypothesis. The extent of contradiction depends in part on the congruence
of the data with plausible alternatives, whereas the P-valuedepends only on the
null distribution.
In summary, a reasonable two-tailed P-value can be obtained in most
situations by either doubling the one-tailed P-value, or adding to it the largest
attainable probability not exceeding it in the other tail. The minimum like-
lihood method may also be satisfactory. The practice of doubling the one-
tailed P-value is perhaps the most popular, but that may be more the result of
habit than a thoughtful consideration of the merits. When all is said and done,
however, we find the game of defining a precise two-tailed P-value not worth
the candle. If a single procedure is to be recommended as appropriate for two-
tailed tests based on any distribution and any outcome, we prefer reporting
the one-tailed P-value and the direction of the observed departure from the
null hypothesis. The primary basis for this recommendation is that the P-
value then retains its clearest interpretation. Further, when the one-tailed
P-value is small, the sample outcome is extreme in a particular direction and a
one-sided conclusion will usually be desired. On the other hand, if the one-
tailed P-value is moderate or not small, the null hypothesis will be "accepted"
whether it is doubled or not. In borderline cases, the appropriate conclusions
require careful thought, not blind adherence to some rule. Careful thought is
perhaps best encouraged by reporting a one-tailed P-value with suitable
commentary attached. For further discussion, see Gibbons and Pratt [1975].
The recommendation for reporting the one-tailed P-value even with a two-
tailed test is further reinforced when we consider the test procedures which
allow a greater variety of conclusions to be reached, as our next topic.

4.6 Other Conclusions in Two-Tailed Tests

If it is necessary to reach a conclusion on the basis of a two-tailed test, the two-


tailed P-value, however computed, can be used as a critical level to define the
test just as in the one-tailed case. Whether a two-tailed test is defined in this
way or by setting up a rejection region in terms of upper and lower critical
values of a test statistic, the general interpretation of tests given in Sect. 4.1
still applies. Specifically, one may conclude, at least tentatively in the absence
of other evidence, that the null hypothesis is false if it is rejected by a "reason-
4 HypothesIs Testing 33

able" two-tailed test, while one may draw essentially no conclusion if it is not
rejected. This two-conclusion interpretation is indeed appropriate in some
situations. For example, rejection may amount to deciding from a preliminary
experiment that further study is worthwhile. Alternatively, it may mean that a
simple model is inadequate, in circumstances where it is not necessary to
conclude how the model might be made adequate.
In many situations, however, more definite conclusions are desirable. For
instance, when the null hypothesis P = Po is rejected by a two-sided binomial
test, we might wantto conclude that p < Po if S S St and that P > Po if S s S",
Then there would be three possible conclusions, namely P < Po, P > Po, and
"no conclusion" (which corresponds to "accepting" the null hypothesis).
Table 4.2 gives the probability of drawing each conclusion in each kind of
situation where it is erroneous. These probabilities are bounded by the one-
tailed levels 0:1 and 0: 2 , For instance, if P > Po, the probability of concluding
that P < Po depends on p but is less than 0:1' There are no entries in the third
column because "accepting" the null hypothesis is regarded as drawing no
conclusion and hence cannot be erroneous. No matter what the true situation,
a two-tailed test, with this three-conclusion interpretation, leads to an er-
roneous conclusion with probability at most 0: = 0:1 + 0:2' the ordinary two-
tailed significance level.
Now suppose we modify the test procedure so that instead of concluding
that p < Po when S SSt, we conclude that p S Po. This leads to the prob-
abilities of erroneous conclusions given in Table 4.3. No matter what the true
situation, the probability of an erroneous conclusion is now at most the
larger of 0: 1 and 0:2'
For example, if the two one-tailed tests each have level 0.05, Table 4.3
shows that this two-tailed test procedure will lead to an erroneous con-
clusion with probability at most 0.05 (the one-tailed level). The procedure
of Table 4.2 permits a more refined conclusion in one case, but at the cost
of increasing the bound on the probability of an erroneous conclusion to 0.10
(the two-tailed level).
If we would be just as happy to conclude that p S Po as P < Po, the second
procedure would be much better than the first because of its much lower error

Table 4.2 Probabilities of Erroneous Conclusions


Ct l = exact level of lower-tailed test
Ct 2 = exact level of upper-tailed test

Observed. S s s( S 2 s" Sf< S < S"


Conclusion: p< Po P > Po no conclUSIOn

True Situation Total


P < Po < Ct 2 < Ct 2
P = Po Ct l Ct2 Ct l + Ct2
P > Po <Ct l <Ct l
34 I Concepts of Statistical Inference and the Bmomlal DistributIOn

Table 4.3 Probabilities of Erroneous Conclusions


0(1 = exact level of lower-tailed test
0(2 = exact level of upper-tailed test

Observeq: S ::; Sf S 2:: Su s« S < Su


Conclusl6n: P ::; Po P > Po no conclusion

True situation Total


P < Po <0(2 <0(2

P = Po 0(2 0(2

P > Po <0(1 >0(1

rate. One might add Po to the upper-sided conclusion instead of, or in addition
to, the lower-sided, making the first two conclusions P < Po and P ~ Po, or
(symmetrically) P ~ Po and P ~ Po. The probability that the procedure will
lead to an erroneous conclusion is still at most the larger of 0(1 and 0(2 in each
case, although the appropriate tables will differ somewhat from Table 4.3
(Problem 24). There is also a procedure with a similar property allowing all
five conclusions mentioned above (Problem 25).
The validity of Tables 4.2 and 4.3, and thus of the alternative interpreta-
tions of two-tailed tests considered here, follows from the fact that the prob-
ability that S :2 Su is larger when P = Po than when P < Po, and similarly in
the other tail. For all two-tailed tests which are used in practice, this fact and
consequently Tables 4.2 and 4.3 remain valid when S is replaced by the test
statistic and p by a suitable parameter e. Thus the corresponding alternative
interpretations of two-tailed tests are always valid in practice.
In summary, when reporting a conclusion from a two-tailed test, a con-
clusion at the appropriate one-tailed level may be more descriptive of the
true probability of erroneous rejection than the two-tailed level, unless it is
e
clear that rejection requires the conclusion #- eo or one of the conclusions
e < eo and >e eo, e
not ~ (Jo or ~ e eo.
From this point of view, a one-
tailed P-value is also more descriptive even in the two-tailed test situation.
This further suggests the desirability of reporting a one-tailed P-value so that
when a definite conclusion rather than a P-value is required, the choice of the
two-tailed procedure which best fits the purposes and problem at hand is left
to the ultimate decision-maker.

5 Randomized Test Procedures

5.1 Introduction: Motivation and Examples

We ha ve seen that discrete distributions present difficulty for carrying out a


test of hypothesis at a desired level 0(. The" conservative" procedure is to
determine the critical value of the test statistic so that the exact level does not
5 Randomized Test Procedures 35

exceed the nominal level IY.. In reporting whether the observations are signifi-
cant, the exact level might be stated instead of, or in addition to, the nominal
level.
Consider, for example, a lower-tailed binomial test with n = 10, Po = 0.6
and IY. = 0.10. The" conservative" procedure has critical value Sc = 3 with an
exact level of 0.055. If the rejection region could be enlarged without increas-
ing the exact level above 0.10, the power would increase. However, the next
possibility is Sc = 4 and PO.6(S :$; 4) = 0.166. If we reject when S :$; 4, the
exact level increases to 0.166, which considerably exceeds the nominal level of
0.10. However, the conservative test is too conservative. Even if the critical
value is chosen to give the exact level nearest the nominal level, the test
remains the same. The exact level 0.055 is far smaller than we would like,
while 0.166 is far larger. What then shall we do?
There is a definite theoretical answer to this question, but unfortunately it
is not a satisfactory practical alternative because it introduces an irrelevant
random variable. This is a procedure called a randomized test. Consider again
the binomial problem, but now suppose that n = 6 and we wish to test H 0:
p ~ 0.5 versus HI: P < 0.5, at the levellY. = -h. (This example is used because
it leads to simpler arithmetic than the previous one. The ideas are the same for
both.) For p = 0.5, n = 6, the binomial probabilities are given in Table 5.1.
We naturally plan to reject when S = O. If we reject when S = 1 as well, the
exact level is -i4, larger than the value 14 selected for IY.. However, if we reject
only when S = 0, the exact level is only 614' How can we enlarge the rejection
region in order to increase the power? A good solution might appear to be to
reject the null hypothesis when S = 1, but not when S = O. This test has
greater power for most values of p < 0.5 than the test rejecting only when
S = O. This is not an appropriate solution, however, since another procedure
is clearly superior, as will be shown shortly.
The respective power functions for any pare

6p(1 - p)5 for the test rejecting when S = 1 only, (5.1)


(1 - p)6 for the test rejecting when S = 0 only. (5.2)

These two functions are graphed in Fig. 5.1. For p very small, a case where
rejection is especially desirable, the power of the test rejecting when S = 1 only
is smaller than for S = 0 only, and in fact decreases to zero, while the power of
the test rejecting only when S = 0 increases to one. From (5.1) and (5.2), the
power of the S = 1 only test is greater than the power of the S = 0 only test for
all p > i- (Problem 26). A test which is a combination of these two tests and

Table 5.1 Binomial Probabilities for p = 0.5, n = 6

s o 2 3 4 5 6

i. li .lQ li i.
64 64 64 64 64
36 1 Concepts of StatistIcal Inference and the BinomIal Dlstnbullon

1.0

0.9

0.7

,-.
0 0.6
:t:
....,u - Reject S = 0 Only
' 0;- 0.5
0:1::
~
0.4

0.3

0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P

Ho False Ho True

Figure 5.1 P(Reject Ho: P ~ 0.5) for II = 6.

has power everywhere greater than either of them would be desirable; this can
be accomplished using a randomized test procedure.
Specifically, consider a test which rejects when S = 0 is observed, rejects
with probability ~ when S = 1 is observed, and "accepts" otherwse. For
instance, when S = 1 we might roll a fair die, reject if the number of spots is less
than 6, and" accept" if it is equal to 6. This procedure makes the probability
of a Type I error when p = 0.5 equal to

(l)Po.s(S = 0) + (i)Po.s(S = 1) = (ik) + (~)(o\) = l4'


which is exactly the desired value.
For any p, the probability of rejection by this randomized test is

This function is also plotted in Fig. 5.1 for all p. The figure shows that this
probability is smaller than l4 for all p > 0.5. It also shows that the randomized
5 RandomIzed Test Procedures 37

test has power everywhere greater than either of the other two tests, and has
smaller probability of a Type I error than the test which rejects when S = 1
only, except at p = 0.5 where the probabilities are the same. Later we will
show that similar properties hold more generally.
Whether or not one would ever use a randomized test in practice, the fact
that there is a randomized test everywhere better than the test rejecting when
S = 1 only shows that the latter test should not be used. In the next su bsection,
we will explain what is meant by randomized procedures generally and why it
is useful to talk about them even though no claim is made that people do or
should carry out irrelevant randomizations.

5.2 Randomized Tests: Definitions

The basic idea behind any randomized procedure is that we decide what action
to take, or what inference to make, or how to report the results, not only on the
basis of an observed random variable X as previously, but also at least in part
on the basis of some irrelevant random experiment. When we observe X = x,
we may reject the null hypothesis, or we may not. We may even decide at
random what to do. That is, we may reject the null hypothesis with probability
¢(x), say, and "accept" it otherwise, where ¢(x) may be any value, 0 ~ ¢(x)
~ 1. This kind of procedure is called a randomized test, and ¢(x) is its critical
function, as already defined in Sect. 4.1. If such a test were carried out re-
peatedly, in the long run the null hypothesis would be rejected in a proportion
¢(x) of those cases in which the observation is x. The randomized test dis-
cussed in Sect. 5.1 which rejected always when S = 0 and with probability
t when S = 1, and "excepted" otherwise, is given by ¢(O) = 1, ¢(1) = i, and
¢(s) = 0 for s 2 2.
Ordinary (nonrandomized) tests are equivalent to randomized tests for
which ¢(x) takes on only the values 0 and 1. Specifically, the nonrandomized
test with rejection region R is given by ¢(x) = 1 for x in a region Rand ¢(x)
= 0 otherwise. Thus we reject the null hypothesis with probability 1 for all
x E R and we "accept" it with probability 1 for all x ¢ R.
The randomization necessary to perform a randomized test could be
carried out by drawing a random variable U (independent of X) from the
uniform distribution between 0 and 1 and rejecting the null hypothesis if
U ~ ¢(x) but not otherwise. Such a U may be obtained, to any desired
accuracy, from a table ofrandom digits or generated by a computer. Since the
event U ~ ¢(x) has probability ¢(x), this procedure rejects the null hypo-
thesis with probability ¢(x) when x is observed. We note incidentally that this
makes any randomized test based on X equivalent to a nonrandomized
test based on (X, U). Instead of carrying out the randomization, one might
report the value of ¢(x) for the x observed.
In the general binomial problem, suppose that S is the number of successes
In n independent Bernoulli trials with probability p of success on each trial.
38 I Concepts of Statistical Inference and the BInomIal DlstnbutIon

A lower-tailed randomized test IS to reject always if S < Sf and reject with


probability ¢t if S = St. The exact level of such a test for Ho: P = Po is
IXl = Ppo(S < St) + ¢tPpo(S = St)· (5.3)
For any IXl(O ~ IXl ~ 1) there is (Problem 28) exactly one lower-tailed
randomized test with exact level IX 1• (All tests with the same critical function
are considered the same, since they can differ only in the method of carrying
out the randomization.) Similarly, an upper-tailed randomized test is to reject
always if S > Sa and to reject with probability ¢,. if S = S,,; its exact level is
(5.4)
and IX2 determines the test (critical function) uniquely.
A two-tailed randomized test is the combination of an upper-tailed and a
lower-tailed randomized test and rejects if either of the one-tailed tests rejects
but not otherwise. That is, a two-tailed randomized test is to reject always if
S < Sf or S > Su' reject with probability ¢t if S = Sf and with probability ¢a
if S = S,,' and "accept" otherwise. 3 The exact level IX of such a test for H 0 :
p = Po is the sum of the exact levels IXl and IX2 above, or

IX = Ppo(S < Sf) + ¢(ppo(S = Sf) + Ppo(S > sJ + ¢«Ppo(S = sJ. (5.5)
There is an infinite number of two-tailed tests at a given exact level 0:. For
each IX10 0 ~ IXl ~ IX, there is one given by the lower-tailed test at exact level (Xl
and the upper-tailed test at exact level IX2 = IX - IX 1 • The difficulty of choosing
among them was pointed out in Section 4.5.

5.3 Nonrandomized Tests Equivalent to Randomized Tests

Some nonrandomized tests amount in a sense to randomized tests. In our


numerical example where n = 6, Po = 0.5, IX = 664 , suppose now that the
original Bernoulli trials are distinguishable (say by the order in which they
were made) and that the individual outcomes are available. Consider the test
that rejects the null hypothesis if there is no success or if there is just one
success and it occurs after the first trial. This test has level exactly 664 , and in
fact it has the same conditional probability of rejection given S (and hence the
same unconditional probability of rejection) as the randomized test above, for
every p (Problem 29). This relates to the fact (see below) that S is a sufficient
statistic. Even though this test is nonrandomized with respect to the original
observations, its behavior for all p can be duplicated by a randomized test
based on S, and hence it is just as reasonable or unreasonable. Rejecting if
there is a single success and it is not on the first trial amounts to rejecting with
probability ~ whenever there is a single success, where the trial number of the
success is used to carry out the randomization.

3 We assume that s( s S" and ¢J( + cPu s I If s( = S" so that the upper and lower tails are
mutually exclUSive
5 Randomized Test Procedures 39

This type of duplication was discussed further in connection with sufficiency


in Sect. 3.3. We showed there how any procedure based on the original
observations can be duplicated by a procedure based on the sufficient statistic
S. Randomized procedures may be required, however. No nonrandomized
procedure based on S duplicates the foregoing nonrandomized procedure
based on the original observations and having level exactly A. If randomized
procedures were not permitted even theoretically, it might be necessary to
look at the original observation rather than the sufficient statistic, not
because the original observations provide more information, but simply
because their distribution is finer and allows randomization to be smuggled in.

5.4 Usefulness of Randomized Tests in Theory and Practice

Randomization removes all theoretical problems stemming from dis-


creteness.1t permits tests to be made at any arbitrary levellJ... Suppose we want
a lower-tailed test at exact level 0.01 in the binomial problem with n = 6 and
Po = 0.5. Then we should reject with probability 0.64 when S = O. The level
of a nonrandomized test that rejects for S = 0 is i4' The only nonrandomized
procedure at a level less than tk is that which never rejects the null hypothesis.
This is true even if the original observations are available.
"Optimum" tests in discrete problems are usually randomized. For
example, the randomized test in Sect. 5.3 at level 664 is the best test at its level,
in the sense that for every P < 0.5, its power is at least as great as that of any
other test, randomized or not, based on the same observations. (This will be
shown in Sect. 7). Accordingly, if we do not want to use randomized tests, the
test used might be regarded as an approximation to this best randomized
test. Then we will presumably choose between rejecting only for S = 0,
giving exact level 6~' and rejecting for S = 0 or 1, giving much greater power
but exact level 674' which is larger than we wanted. In any case, only one-tailed
tests need be considered, and not, for example, the test rejecting only when
S = 1 but not when S = 0. Without randomized tests, this is much more
difficult to see.
One can argue that there is certainly no harm in introducing randomized
procedures. If they are inferior, the theory should show that. They serve many
useful purposes, as we have already seen. They show clearly that some pro-
cedures should be excluded from consideration. They justify restricting
consideration to procedures which are based on a sufficient statistic, and they
clarify the nature of procedures which are not (such as the one that rejects for
no successes or one success not on the first trial). They permit a simple and
reasonable formulation of theoretical problems (such as "What is the best
test at level A of the null hypothesis p ;;:: 0.5?"). The solutions of these
theoretical problems provide insight about the choice of procedures, and
about the limitations of this kind of theory for choosing procedures. Further,
randomized procedures are useful for power comparisons of tests which
40 1 Concepts of StatIstical Inference and the Bmomlal Distribution

cannot be performed at the same level using nonrandomized tests. As


mentioned in Sect. 4.2, power comparisons of tests may be uninterpretable,
or at least misleading, unless the exact levels are the same.
The problem of what to do in practice still remains. It does not seem that
one should be required to use a randomized procedure in a practical statistical
problem, although if two procedures are equally good one might not object
to letting an irrelevant randomization choose between them. (Reasons can be
given for requiring randomized procedures, for example, in a game against
an opponent or in writing quality control contracts, where a specific a may
be desirable. However, these reasons do not apply to ordinary statistical
problems.) Even when only a randomized test will provide the desired level,
the authors do not recommend that a randomized procedure be actually
performed. To do so without explanation would be a deception, and once it is
explained, any reader can easily carry it out if he thinks it useful. The process
of performing a randomization does not add to the information provided by
the observations.
In the case of one-tailed tests, where the P-value is uniquely defined,
following the procedure of reporting the P-value essentially eliminates the
problem, although it remains under the rug in that a discrete P-value is
implicitly conservative. For two-tailed tests, the problems discussed in
Sect. 4.5 are relevant. Reporting the one-tailed P-value, but clearly labeled as
such, is probably the simplest solution in practice. In either case, one might
also report the next P-value, the tail probability beyond the observed value.
In the binomial example above with n = 6 and Ho: P ~ 0.5 (Table 5.1),
suppose that S = 1. The one-tailed test results could be reported by giving the
exact P-value 674 = 0.109 and perhaps also the next P-value l4 = 0.016. That
is, one could report significance at the exact level 0.109 and perhaps also that
the next more extreme result would have been significant at the exact level
0.016. If the test were two-tailed for the null hypothesis p = 0.5 and the
alternative p ::;:. 0.5, the same P-values could be reported but labeled as one-
tailed.

*5.5 Randomized P-values

Extending the earlier definition of the P-value to randomized tests gives what
we shall call the randomized P-value, which is uniformly distributed between
the exact P-value and the next P-value (Problem 32). If the P-value is to
measure the extent to which the data support the null hypothesis (see Sect. 4.4
for difficulties with this interpretation), a single number is presumably
desired for each possible outcome. The mid-P-value is suggested by the fact
that the distribution of the randomized P-value is symmetric about it (and
in particular has mean and median equal to it). The observations are sig-
nificant at nominal levels above this value and not significant at nominal levels
below it if the test chosen at nominal level a is the nonrandomized test having
6 Confidence Regions 41

greatest probability of agreeing with the randomized test at exact level 0:, or,
as mentioned in Sect. 4.4, that with exact level nearest to the nominal level.
This does not alter the recommendatIon made above to report the exact
P-value as well as the next P-value. Anyone who thinks that the mid-P-value
is of special interest can then compute it. *

6 Confidence Regions
We now introduce the concept of confidence regions. This form of inference,
like estimation, refers to any parameter value, not just a preselected one, yet
also, like a significance test, provides an exact statement of error probability.
We shall lead up to confidence regions by way of tests to remove the mystery
of their construction and to emphasize the intimate relationship between the
two concepts.

6.1 Definition and Construction in the Binomial Case

Consider again the situation of n Bernoulli trials with unknown probability


P of success on each trial. We have seen how to perform a one- or two-sided
test at level 0: for the simple null hypothesis H 0: P = Po. Suppose now that we
have no reason to choose a particular value of Po to be tested in H o. Then we
might prefer to determine for each value Po whether or not the null hypothesis
P = Po would be rejected. This procedure will give the set of values of Po which
would be "accepted" if a (nonrandomized) one- or two-sided test were
performed at fixed level 0:. For standard binomial tests this set turns out to be
an interval of values for Po (which may extend to 0 or 1). In terminology to be
introduced formally later, this interval is called a confidence region or interval
with confidence level 1 - 0:.
Suppose, for example, that n = 10 and 0: = 0.05. If S = 3 successes are
observed, a lower-tailed test of H 0: P = Po would call for rejection of H 0 if
Ppo(S ~ 3) ~ 0.05 and not otherwise. We find (for instance, by interpolation
in extensive tables of the binomial distribution such as those published by the
National Bureau of Standards [1949] or Harvard University [1955]) that
P 0.607(S ~ 3) = 0.05
P po(S ~ 3) < 0.05 for Po > 0.607
P po(S ~ 3) > 0.05 for Po < 0.607.
The decision would therefore be to reject H 0: p = Po for all Po ~ 0.607 and to
"accept" it for all Po < 0.607.
More generally, if we observe s successes in n trials, the null hypothesis
P = Po would be rejected by a lower-tailed test at the level 0: if the lower tail
probability satisfies P po(S ~ s) ~ 0:, and would be "accepted" otherwise.
42 I Concepts of Statistical Inference and the Binomial Distribution

Since, for given s, this probability is a strictly decreasing function of Po


(Problem 13) there exists a value Pu' which depends on s, such that
Ppo(S ~ s) = rx for Po = PI' (6.1)
Ppo(S ~ s) < rx for Po > PI' (6.2)
Ppo(S ~ s) > rx for Po < Pu' (6.3)
The lower-tailed test at level rx would then reject H 0: P = Po for all Po ~ PI"
and would "accept" it for all Po < PI"
We now change the emphasis slightly. Suppose that P = Po. Notice that PI'
is a random variable; it is a function of the random variable S, and is some-
times written pJS) to emphasize this fact. Since PI'S; Po if and only if Ho:
P = Po is rejected at level rx, the probability that PI' S; Po is the probability of
rejection, which is at most rx. Similarly, the probability of the complementary
event Pu > Po is at least 1 - rx. These statements hold for any value of Po.
In short,
Ppo[pJS) ~ Po] ~ 1 - rx for all Po. (6.4)
When this property holds, we call the random variable Pu(S) an upper
confidence limit (or upper confidence bound) for P at confidence level 1 - rx.
This limit was obtained from a lower-tailed test. The confidence property says
that, whatever the true value Po may be, before S is observed, the probability is
at least 1 - rx that the upper confidence limit Pu(S) will be at least Po.
Similarly, a lower confidence limit (or lower confidence bound) for p at
confidence level 1 - rx is defined as a random variable plS) such that
Ppo[Pt(S) ~ Po] ~ 1 - rx for all Po. (6.5)
This lower confidence limit corresponds to an upper-tailed test at level rx.
For any S, a lower confidence limit and an upper confidence limit for p,
each at level 1 - rx, when taken together, are said to form a two-sided confi-
dence interval for P at confidence levell - 2rx because if plS) and Pu(S) satisfy
(6.5) and (6.4) respectively, then
Ppo[plS) S; Po S; Pu(S)] ~ 1 - 2rx for all Po. (6.6)
As an example, Table 6.1 shows the lower and upper 90 %confidence limits
for p, corresponding to upper-tailed and lower-tailed tests respectively, each at

Table 6.1 Lower and Upper 90% Binomial Confidence


Limits for n = 5

s 0 2 3 4 5

p,(s) 0.000 0.021 0.112 0.247 0.416 0.631


puCs) 0.369 0.584 0.753 0.888 0.979 1.000
6 Confidence Regions 43

level 0.10, for n = 5 (Problem 35). The upper confidence limits are found
following a procedure analogous to (6.1)-(6.3), for each possible value of S.
The lower limits are found similarly. Notice that half of the values in Table 6.1
can be obtained by subtraction, since pAs) = 1 - p,,(n - s) for any s (Prob-
lem 37). This example will be discussed further in Section 6.4.
The results in Table 6.1 are plotted as points on two curves in Fig. 6.1. The
principle of construction for this graph can be extended to any sample size;
this has been done by Clopper and Pearson [1934J to produce the well-known
Clopper-Pearson charts. The chart for confidence level 0.95 is reproduced as
Fig. 6.2 in order to illustrate the format. These charts provide a convenient
method for finding upper, lower, or two-sided confidence limits in binomial
problems. They also provide a graphic version of the foregoing derivation.
Consider the region between the curves for a given n. The horizontal sections
are the values of sin for which each given p would be "accepted." The vertical
sections are the values of p which would be "accepted" for a given sin. The
vertical section corresponding to the observed sin covers the true p if and only
if the horizontal section corresponding to the true p covers the observed sin.
The relations between tests and confidence regions and between their error
probabilities follow.
Though slightly less convenient than graphs, Table C is compact and
allows greater accuracy in finding binomial confidence limits. It includes five
common levels (J. in the range 0.005 ~ (J. ~ 0.100. For each s, for sin < 0.50,
the tabulated values are n times the confidence limits, and hence are simply
divided by n to obtain the confidence limits. For sin> 0.50, Table C is
entered with 1 - (sin) and lower and upper are interchanged; the correspond-
ing entries are then divided by n and subtracted from 1 to find the confidence
limits. The values s = 0 and s = n are special cases, as explained in the table.
To illustrate the use of Table C, consider n = 5 and s = 2. Then sin = 0.40,
and the table entries for (J. = 0.10 are 0.561 and 3.77. These numbers are

0.2

Figure 6.1 Lower and upper 90% confidence limits for p when 11 = 5.
44 I Concepts of Statistical Inference and the BlOomllll DIstribution

0.98 094 090 0.86 0.82 0.78 0.74 0.70 0.66 0.62 0.58 0.54
0.90

Figure 6.2 Chart providing confidence lImits for p in binomial sampling, given a
sample fraction YIn, confidence coefficient, 1 - 2a = 0.95. The numbers printed along
the curves IIldicate the sample size n. If for a given value of the abscIssa YIn, L, and U
are the ordinates read from (or interpolated between) the appropriate lower and upper
curves, then P(L ::; p ::; U) ;::: 1 - 2a. (Adapted from Table 41, pp. 204-205, of E. S.
Pearson and H. O. Hartley, Eds. (1962), Biometrika Tables Jor Statisticians, Vol. I,
Cambndge University Press, Cambridge, with permission of the BlOmetnka Trustees.)

divided by 5 to get 0.112 and 0.754 as lower and upper limits, which agree
(except for rounding) with Table 6.1.
Binomial confidence limits are quantiles of the beta distribution. They
are also ratios of linear functions of quantiles of the F distribution. The
approximation at the end of Table C results from a transformation of the F
distribution derived from the cube root transformation of the chi-square
distribution (see Wilson and Hilferty [1931], Paulson [1942], Camp [1951]
and Pratt [1968J).
6 Confidence RegIOns 45

6.2 Definition of Confidence Regions and Relationship to Tests in


the General Case

The concept and derivation of confidence limits explained above for binomial
problems generalize directly to confidence regions for any parameter 0 in any
situation. Suppose that we have a test based on any random variable S of the
null hypothesis e = 00 , for each value of eo and level IX. For any fixed eo and IX,
the test will reject for certain values of S and not for others. The set of values of
S for which the null hypothesis e = eo would be" accepted" will be denoted by
A(e o). Once S = s is observed, we are interested in the converse question: for
which values of eo would the null hypothesis e = 00 be "accepted?" The
set of all such values of 00 is a region C(s), defined by

eo E C(s) if and only if s E A(e o). (6.7)

If the test of the null hypothesis 0 = eo has level IX, then the probability of
e
"accepting" = 00 when it is true is at least 1 - IX, that is

(6.8)

(If e = eo allows more than one distribution of S, this holds for all of them, and
similarly hereafter.) But the event S E A(Oo) is, by the definition (6.7) of C,
equivalent to the event 00 E C(S). Substituting this equivalence in (6.8), we
have

(6.9)

If, for each 00 , the test of the null hypothesis 0 = eo is a test at level IX, then
(6.9) holds for every 00 , This is the defining condition for a confidence region
C(S) at confidence level 1 - IX. We see that a test at level IX for each eo leads
immediately to a corresponding confidence region at confidence level 1 - IX.
The left-hand side of (6.9) is called the true confidence level and is discussed in
Sect. 6.4. The exact confidence level is the maximum value of 1 - IX such that
(6.9) holds for all possible distributions of S. This is the minimum (or infimum)
of the left-hand side, the probability that C(S) includes the true parameter
value, over all possible distributions. Nominal and conservative levels are
defined in the obvious way.
Conversely, if one has a confidence region C(S) at confidence levell - IX,
then for each eo a test of the null hypothesis e = 00 at level IX may be performed
by "accepting" if the confidence region C(S) includes 00 and rejecting other-
wise. This is equivalent to defining the "acceptance" region A(Oo) by (6.7),
"accepting" the null hypothesis 0 = 00 if S E A(e o) and rejecting it otherwise.
Thus there is an exact correspondence between a confidence region for 0 at
confidence level 1 - IX and a family of tests of null hypotheses 0 = 00 , each
test at significance level IX. It is conventional, if not particularly convenient, to
measure significance levels as (Type I) error probabilities and confidence
46 I Concepts of Statistical Inference and the Binomial Distribution

levels as 1 minus the error probability. Typical significance levels are 0.10,
0.05,0.01, and the corresponding typical confidence levels are 0.90, 0.95, 0.99.
We generally adhere to this convention, although context would determine
the meaning anyway. Notice that the definition of a confidence region, that
(6.9) holds for all eo, makes no reference to hypothesis testing. Nevertheless,
the relationship is so intimate that it should never be lost sight of.
For the parameter p of the binomial distribution, the confidence regions
defined above agree with the confidence limits derived in Sect. 6.1. Specifically,
the confidence region corresponding to the family of lower-tailed tests at
level 0: is the interval p < p", where p" is the upper confidence limit for p at
level 1 - 0:, and similarly for upper-tailed and two-tailed tests. Verification
beyond what has already been given is left to the reader (Problem 40). Some
confidence procedures for the binomial parameter which correspond to other
two-tailed tests are discussed later, in Sect. 8.2.

6.3 Interpretation of Confidence Regions

The random region C(S) must be distinguished from the particular region
C(s) obtained for the particular value s of S. C(S) is a random variable,
because it is a function of a random variable. The distinction between C(S)
and C(s) is another case of the distinction between a random variable and its
value, like that between estimator and estimate in Sect. 3. Unfortunately, the
term confidence region is standard for both the random variable and its value,
and it must be determined from the context which is meant. Intuitively, a
confidence region is random before the observations are made, but not
afterwards.
If C(S) is a confidence region at confidence level 1 - 0:, then, by definition,
the event eo E C(s) has probability at least 1 - 0: whatever the true value eo.
However, once a value S = s is observed, the probability that the true value eo
lies in C(s) is not necessarily 1 - 0:. In the "frequentist" framework of
probability, one can say only that the probability that the true value lies in C(s)
is unknown, but must be either 0 or 1. (A similar point was made in Sect. 4.2 in
connection with hypothesis tests.) The confidence level is a property of the
confidence procedure, not of the confidence region for a particular outcome.
One interpretation of a confidence region is as those values which cannot
be rejected, with a testing interpretation of rejection. Some people are careful
to limit themselves to this interpretation. Most, however, probably go some-
what further in allowing the connotations of the word "confidence" to enter
their thinking, without necessarily claiming any very strong justification for
doing so,
The Bayesian framework sometimes justifies confidence in observed
confidence regions. When probabilities are used in a Bayesian framework to
represent" degrees of belief" (see Sect. 4.2), prior belief may have much less
6 Confidence RegIOns 47

influence than the observations. In this case, if an "appropriate" confidence


procedure is used, then the probability, after the observations are made, that
the true value is included in the confidence region is approximately equal to
the confidence levell - oc. (Here the true value is the random variable, while
the confidence region is fixed by the observations.) A rather similar statement
is that, in the absence of any information other than that provided by the
data, after seeing the observations and computing a confidence region at
levell - oc, approximately fair odds for a bet that the region includes the true
value would be 1 - oc to ()(. This statement can be made precise enough to judge
its truth only by using probabilities for degrees of belief. "Absence of inform a-
tion" is especially difficult to represent mathematically.
The defining property of a confidence procedure does not in itself justify
any such interpretation after the observations have been made, and indeed
there are confidence procedures for which such interpretations are demon-
strably unreasonable. However, when little information is available a priori,
we can say that such interpretations" generally" hold" approximately" for
"appropriate" procedures in most problems. Regardless of statistical phil-
osophy, one would not ordinarily make a confidence statement if one had
reason to doubt the appropriateness ofthe confidence level as an approximate
measure of the confidence to be placed in it after seeing the observations.
The foregoing discussion provides a very natural "approximately
Bayesian" way to interpret confidence regions. No such simple interpretation
oftests or P-values is possible in general. Further, a confidence region conveys
greater information than a test, since it is equivalent to a family of tests.
These two facts make the reporting ofconfidence regions adesirable alternative
to, or addition to, reporting test results whenever possible. Typically, a
confidence region for a parameter is similar to an estimate plus or minus a
measure of uncertainty, but it is statistically more refined. It provides insight
into both the magnitude of the parameter and the reliability of the available
information about it, measured in terms of precise, readily apprehended
statistical properties. On the other hand, the most conclusive result possible
from a test is rejection of a possibly" straw man" null hypothesis.
One caveat we must consider is that if the probability that eo E C(S), say
1 - oc(lJ o), is not a constant 1 - oc for all eo, the interpretations above may
need modification. In particular, they will more nearly hold for some kind of
an average of oc(lJ o) than for a conservative nominal level or the exact level
defined above as one minus the maximum of oc(lJ o). This distinction is not
always negligible in practice. For instance, Fig. 6.3 shows, for n = 5, the
probability 1 - oc(Po) that the ordinary two-sided binomial confidence
interval at nominal level 0.80 will include the true value Po, as a function of Po.
The exact level is 0.83, but 1 - oc(Po) is above 0.90 over most of the range of Po,
so both 0.80 and 0.83 seem to understate considerably the confidence one can
have in the interval. We use a small n and large oc here for simplicity, but the
error rate would be overstated by a similar factor for smaller oc and larger n.
48 1 Concepts of Statistical Inference and the BlIlomial Distribution

1.00

~
0.95
-------
u> 0.85
~ 0.80
...::>
OJ

I-

! , ! I I , I , I ! t '

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 p

FIgure 6.3 True confidence level for 80% confidence mtervals for p when n = 5.

6.4 True Confidence Level

It is instructive to examine in some detail the calculation of the probability of


including the true p as a function of p in a binomial example. Recall that the
values given in Table 6.1 for lower and upper 90 %binomial confidence limits
on p were each constructed to correspond to a one-tailed test at nominal level
0.10. The exact level of the test is a function of p. Hence so is the true level of the
confidence procedure, where, for any value of p, the true confidence level here
means the probability (computed under p) that the confidence region will
include p. For instance, looking at Table 6.1, we see that for p = 0.4, the two-
sided confidence interval will include p if S = 1,2 or 3, but not otherwise. The
true confidence level then is the sum
3
L PO.4 (S = s) = po.iS $ 3) - PO•4 (S = 0) = 0.8352
s= 1

from Table B. This can be calculated for any p, and Fig. 6.3 shows a graph of
the result.
Numerical values for selected p are shown in Table 6.2 under the heading
True Level. The true, two-sided confidence level can also be computed in two
pieces as follows. The probability that the interval includes p is one minus the
probability that it does not include p, and the latter will occur if and only if pis
either (a) smaller than the lower confidence limit, or (b) larger than the upper
confidence limit. But p is smaller than the lower limit if and only if, for that p,
the null hypothesis would be rejected by an upper-tailed test. This occurs if the
observed s satisfies P is
~ s) $ 0.1 0, that is, if s is in the upper 0.1 0 tail of the
distribution corresponding to p. Hence the probability of (a) is the largest
upper tail probability, not exceeding 0.10, of the distribution corresponding
to p. For p = 0.4, n = 5, for instance, we see in Table 6.1 that p is smaller than
6 Confidence RegIons 49

Table 6.2 True Confidence Level for Confidence Intervals Corresponding


to Equal-Tailed Binomial Tests at Nominal Level 0.20 when n = 5

p 0 .021- .021 + .050 .100 .llr .112+ .200 .24T


Lower a 0.000 .000 .000 .000 .000 .000 .000 .000 .000
Upperb 0.000 .100 .004 .023 .081 .100 .012 .058 .100
Sum 0.000 .100 .004 .023 .081 .100 .012 .058 .100
True Level 1.000 .900 .996 .977 .919 .900 .988 .942 .900

p 0.24r .300 .369- .369+ .400 .416- .416+ .450 .500


Lowera 0000 .000 .000 .100 .078 .069 .069 .050 .031
Upperb 0.015 .031 .065 .065 .087 .100 .012 .018 .031
Sum 0.015 .031 .065 .165 .165 .169 .081 .068 .062
True Level 0.985 .969 .935 .835 .835 .831 .919 .932 .938

a ProbabIlity upper confidence limit IS smaller than p (equals largest lower tall probability not
exceedlllg 0.10)
b ProbabIlity lower confidence limit is larger than p (equals largest upper tail probabIlity not
exceeding 010).

the lower confidence limit if and only if the observed s is 4 or 5, which has
probability PO.4(S 2 4) = 1 - 0.9130 = 0.0870. This is indeed the largest
upper binomial tail probability for p = 0.4 which does not exceed 0.10.
Similarly, the probability of (b) is the largest lower binomial tail probability
which does not exceed 0.10. For p = 0.4, this is P o.iS ~ 0) = 0.0778. Since
(a) and (b) are mutually exclusive events, the true two-sided confidence level
is 1 - (0.0870 + 0.0778) = 0.8352, as found in the previous paragraph.
Similar results for other values of p :$ 0.5 are also shown in Table 6.2 under
the headings Upper and Lower respectively. The entries for values of p not
included in Table B are found in the same way, but using more extensive
tables of the binomial distribution or interpolation or a computer. Since the
true level is not a continuous function of p, it is necessary to give special
attention to the points of discontinuity, which are the possible confidence
limits. A similar table for p > 0.5 can be obtained by changing the label p to
1 - p and interchanging the labels Lower and Upper in Table 6.2.

6.5 Including False Values and the Size of Confidence Regions

In this subsection we show that if a family of tests is powerful, the correspond-


ing confidence regions will have high probability of excluding false values, and
will also be small according to natural measures of size. 4 The exact sense in

4 As used here, the word" sIze" IS to be IIlterpreted III a nontechmcal sense, not to be confused
wIth the previous techmcal definition as the level of a test. Since the technical term size IS not used
III thIS book, no difficulty III IIlterpretation should occur.
50 I Concepts of Statistical Inference and the Bmomial Distnbutlon

which this is true is complicated by the fact that different tests may be powerful
against different alternatives, and, correspondingly, different confidence
regions have high probability of excluding false values and are of small size
under different hypotheses. These ideas will be discussed briefly below.
Further explanation is given by Pratt [1961] and Madansky [1962].
Suppose C(s) is a confidence procedure corresponding to a family of tests
with" acceptance" regions A{Oo). Since 00 E C(S) if and only if S E A{Oo), we
have
poJOo E C(S)] = poJS E A(Oo)] for all 0 1 , (6.10)
For 0 1 = 00 , this says that'the probability of " accepting" the null hypothesis
o= 00 when it is true is equal to the probability that the confidence region
will include the value 00 when 00 is the true value. This result has already
been used in demonstrating that 1 - IX is the confidence level if all the tests
have level IX (establishing (6.9) from (6.8». For 01 '" 00 , (6.10) says that the
probability of a Type II error at 0 1 for the test of the null hypothesis 0 = 00
equals the probability, when 0 1 is the true value, that the confidence region will
include the false value 00 , Thus, the ability of a confidence procedure to
exclude false values is the same as the power of the corresponding tests, in this
specific sense. This also leads to a correspondence between certain" optimum"
properties of tests and confidence procedures. (See Sect. 7.3).
We consider next the size of a region, defined as its length (if it is an interval),
its area (if it is a region in the plane), and in general its k-dimensional volume
(if it is a region in k-dimensional space). It is perhaps more reasonable to be
concerned with the probability of including false values than the size of a
confidence region, since there is no merit in a small region if it is not even close
to the true value. At the same time, it is natural to feel that a good confidence
procedure would produce small regions. There turns out to be a direct
connection between the size of the confidence region and the probability of
including false values which implies that making either one small will tend to
make the other small.
Since a confidence procedure does not ordinarily give a region of fixed size,
let us consider the expected size of the confidence region. Consider, for instance,
a two-sided confidence interval [0(0 OJ for a one-dimensional real parameter
O. The size is the length 0,. - 0" and the expected length is Eo.[O,,{S) - Ot{S)]
if 0 1 is the true value. By Problem 41 we have

Eo.[Ou{S) - OlS)] = f Po.[Ot{S) ::; 0 ::; O,.(S)]dO (6.11)

and the integral is unchanged if the integration is over all 0 except the true
value. In other words, the expected length is the integral of the probability of
including false values. This is true of size generally (Problem 42). The
essential condition is that the expected size and the probability of inclusion
must be computed under the same distribution.
Similarly, suppose we are concerned only with the upper confidence bound
6 Confidence Regions 51

Then wedo not mind including false values smaller than the true value f:) 1,
f:) I I '
but prefer to exclude those greater than 0 l' That is, it does not matter if 0" > 0
for 0 < 0 1 , but we would like the probability that 011 > 0 to be small for 0 > 0,.
We would also like to overestimate 0 1 by as little as possible. In thise case the
role of size is played by the" excess," defined as Ou - 01 for Ou > 01 and 0 for
0" ~ 0 l' In particular, the expected" excess" equals the integral over all 0 > e1
of the probability that e" exceeds e. When this probability is small, the expec-
ted "excess is small, and conversely."
The foregoing statements imply that when confidence regions correspond
to powerful tests, the confidence regions will have, first, high probability of
excluding false values and, second, small expected size. The exact relationship
between the properties of confidence regions and the relative emphasis placed
on various alternatives in seeking powerful tests is subtle. It is common
practice to choose tests which have desirable power functions without reference
to confidence properties, and to use the corresponding confidence regions
without investigating the confidence properties. The remarks above suggest
that this practice will provide good confidence regions.

*6.6 Randomized Confidence Regions

In principle, application of the theory of randomized tests to confidence


regioQs presents no difficulty. Given the observations, the probability that any
point belongs to the confidence region equals the probability that the ran-
domized test for the corresponding null hypothesis" accepts" it. For example,
a lower-tailed randomized test at exact level A in a binomial problem with 6
observations" accepts" the null hypothesis p = 0.5 with probability i if one
success occurs and always if more than one success occurs (see Sect. 5.1). The
corresponding probability is easily computed for all values of p (see Problem
44). The upper confidence bound could then be reported in the form of a
graph (for the particular outcome observed) showing the probability of
"acceptance" for various values of p, according to a randomized test. This
graph is monotonically decreasing (Problem 45). (A nonrandomized con-
servative upper confidence bound would be the point at which the graph
reaches O. The point at which the graph crosses 0.5 is the upper confidence
bound corresponding to the nonrandomized test with exact level nearest 664 ),
To carry out the randomization, one could choose U uniformly distributed
between 0 and 1 and take as the upper confidence bound the point at which
the graph reaches the value U. For one-sided binomial tests at the same exact
level for all p, this is always possible because the graph is monotonic. For two-
sided randomized tests with maximum average power, it is not possible to
choose the corresponding confidence region so that it is always an interval.
(See also Pratt [1961].) In any case, the defining property of a confidence
region could be satisfied by carrying out a separate randomization for each p,
but the resulting region would be literally an unimaginable hodgepodge.
52 I Concepts of Statistical Inference and the Binomial DistributIOn

In practice, randomized confidence procedures are not used. Actually


doing the randomization is, if anything, intuitively even less desirable here
than in the context of hypothesis testing, and reporting what would happen if
randomization were done is also more complicated. *

7 Properties of One-Tailed Procedures

7.1 Uniformly Most Powerful One-Tailed Tests

In Sect. 3 we discussed some optimum properties that point estimators might


have in general, and do have in the binomial problem in particular. In
developing tests of hypotheses for the binomial problem, we have so far relied
mainly on intuition. Are there optimum properties which justify use of the
one-tailed binomial test in a one-sided binomial problem? It does not go
without proof, and indeed is not true in all problems, that a one-tailed test or
procedure should necessarily be chosen when the null hypothesis is one-sided,
or when the alternative is one-sided, or even when both are one-sided. It is true,
however, in a wide class of problems arising in practice, in particular in all
binomial problems, as we shall now show. (See Karlin and Rubin [1956] or
Lehmann [1959] for a general theorem.)
The most important justification of one-tailed tests is a property of their
power, which will be illustrated first in a special case. Suppose S is the number
of successes in 10 independent Bernoulli trials, and we want a test of the null
hypothesis p ~ 0.6 against the alternative p = 0.5 at the level 0.055. It will be
shown below that, among tests of this null hypothesis at this level, the test
which is most powerful against this simple alternative is the one which rejects
the null hypothesis for S ~ 3 and "accepts" it otherwise. The same test is
most powerful against the alternative that P = PI for any PI < 0.6. We
summarize these facts by saying that the test is "uniformly most powerful"
against the alternative P < 0.6. This means that, if the probability of rejection
for any other test never exceeds 0.055 for p ~ 0.6, then it never exceeds the
probability of rejection for this test (as in Fig. 4.1) for p < 0.6. Equivalently, if
some other test has greater probability of rejection than this test for some
P < 0.6, then it has probability of rejection greater than 0.055 for some
p ~ 0.6.
In general, a test is the uniformly most powerful test against a certain alter-
native hypothesis if it simultaneously maximizes the power against every
alternative distribution. The tests and alternatives considered may be
restricted in some way. Thus a test is uniformly most powerful among tests of
a specified class against a specified set of alternatives if no other test of the
specified class has larger power against any alternative of the specified set. If
the class of tests is not specified, then it is understood to be all tests of the same
null hypothesis at the same level as the test in question (and based on the same
7 Properties of One-Tailed Procedures 53

observations, of course). A test may be uniformly most powerful against one


alternative hypothesis but not against another. The test discussed in the
previous paragraph, for example, is not uniformly most powerful against the
alternative that S is not binomially distributed.
Now suppose that S is the number of successes in n Bernoulli trials with
probability P of success on each trial. As defined in Sect. 5, a lower-tailed
randomized test rejects always if S < St and with probability CPt if S = St; it
includes the lower-tailed nonrandomized tests (CPt = 0 or 1). The exact level of
such a test for the simple null hypothesis P = Po is P po(S < St) + CPtP po(S =
St). A lower-tailed binomial test of the simple null hypothesis P = Po or the
composite null hypothesis P ~ Po is uniformly most powerful at its exact level
against alternatives below Po. The result for p = Po is stated here as Theorem
7.1; it will be proved in Sect. 7.4.

Theorem 7.1. A lower-tailed binomial test at exact level rx of the simple null
hypothesis Ro : P = Po has maximum power against any alternative P = Pi with
Pi < Po, among all level rx tests of Ro·

The property in Theorem 7.1 also holds for the broader null hypothesis
p ~ Po because the test has the same exact level for p ~ Po, while any other
test at level rx for P ~ Po has level rx for P = Po, and therefore, by Theorem 7.1,
cannot be more powerful against any alternative Pi < Po.
Of course, if a lower-tailed test has level rx for the null hypothesis p ~ Po,
but its exact level is rxo where lXo < rx, then there will be other tests at level rx
which are more powerful; however, they all must have exact level greater than
rxo. As a matter of fact, slightly more is true. Any test whose power is greater
than a lower-tailed test against any alternative Pi < Po also has greater
probability of a Type I error under every null distribution P ~ Po.
Similar results hold for upper-tailed tests, and indeed follow simply by
interchanging" success" and "failure."

*7.2 Admissibility and Completeness of One-Tailed Tests

The results of Sect. 7.1 imply that any lower-tailed test for the null hypothesis
P ~ Po is admissible, meaning that any test with smaller probability of error
under some distribution has greater probability of error under some other
distribution. (The errors may be Type I or Type II, as the distributions may be
null or alternative.) Briefly, any test which is better somewhere is worse
somewhere else. If a test is not admissible, then it can be eliminated from
consideration because there is another test at least as good everywhere, and
better somewhere. This other test has power at least as great against every
alternative and probability of a Type I error at least as small under every null
distribution, and it either has greater power against some alternative, or else
54 I Concepts of StatIstical Inference and the BInomIal Dlstnbution

has smaller probability of a Type I error under some null distribution. 5 On


the other hand, an admissible test cannot be excluded from consideration
without adducing some further criterion, because any other test is either
equivalent everywhere or inferior somewhere.
A further property of the lower-tailed tests is that they form a complete
class of tests for the null hypothesis p ~ Po. This means that, given any test,
there is a lower-tailed test at least as good everywhere. Specifically, this is the
lower-tailed test with the same probability of rejection under Po. (The proof
is given in Sect. 7.4.)
The definitions just given agree with the usual definitions of" admissible"
and" complete" in decision theory. 6 Slightly different definitions are some-
times used in the theory of hypothesis testing. A test is sometimes called
admissible if any test with larger power against some alternative has either
smaller power against some other alternative or larger exact level. A class of
tests is sometimes called complete if, given any test, there is a test in the class
with exact level as small and power as large against every alternative. These
alternative definitions are in accord with the emphasis on only the maximum
of the probability of a Type I error in testing theory. They are not really in
accord with common sense, however, and therefore will not be used here. It is
possible for two tests, say A and B, to have the same level and the same
probability of error everywhere except that test A has smaller probability of
a Type I error than test B under some null distributions. Then A is better than
B in common sense. By the alternative definitions, however, B might be
admissible and might be that member of a complete class excluding A which
was supposed to be as good as A.*

7.3 Confidence Procedures

The properties of a confidence procedure are analogous to the properties of


the corresponding family of tests. Specifically, corresponding to the usual
family of nonrandomized, lower-tailed binomial tests at conservative level Q:
is the usual nonrandomized, upper confidence bound. Let 1 - Q:{p) be the true
confidence level of this procedure, the probability that the bound is at least p
when p is the true value. For each Po, among upper confidence bounds whose

5 It does not immediately follow that all tests which are not admISSIble can be eliminated
simultaneously, because one could imagine an Infinite sequence of tests, each better than the one
before but WIth no test better than all of them. This does not occur in practical testing problems,
however See, for instance, Lehmann [1959, Appendix 4J
6 In restrIctIng consideratIOn to sufficient statIstIcs, we have already adopted the vIew that when

several procedures have the same probability of rejection for all p, they are equivalent, and only
one of them need be consIdered. Our definItion of "complete" reflects this vIew. Often "com-
plete" IS defined to reqUIre excluded procedures to be strIctly infenor; what we call" complete"
here is then called "essentially complete."
7 Properties of One-Tailed Procedures 55

true level at Po is at least 1 - ex(po), this bound simultaneously maximizes the


probability offalling below Po for all true values of P below Po. Among bounds
whose true level is at least as great everywhere, it simultaneously minimizes
the expected "excess" over the true P whatever it may be (Sect. 6.5). (These
properties do not hold if the true level is merely required to be at least 1 - ex,
rather than 1 - ex(p). Problem 46 gives an example of a level 1 - ex, non-
randomized upper confidence bound which is distinctly unreasonable but,
under a particular p, has no greater probability of including any Po > p than
the usual bound, has smaller probability for some Po, and has smaller expected
"excess.")
Next consider the randomized, upper confidence bound corresponding to
the family of randomized, lower-tailed binomial tests at exact level ex (Sect. 6.6).
Among upper confidence bounds at levell - ex, whatever the true P may be,
this bound simultaneously maximizes the probability of falling below Po for
all Po > P and minimizes the expected "excess" over p.
Finally, consider any confidence procedure corresponding to a family of
one-tailed binomial tests. A property like that given above for the usual
nonrandomized bound holds. It must, however, be suitably restated if the
confidence region is too complicated to be given by a confidence bound.
Whether it is or not depends on how the exact level ex(p) varies with p. A
confidence bound is obtained for the family of nonrandomized tests chosen to
have exact level as close as possible to ex (Problem 47) as well as for the two
cases mentioned above.

*7.4 Proofs

Let ex(p, ¢) be the probability that the null hypothesis P = Po will be rejected
by the test ¢ when p is the true value. Given any Po, 0 < Po < 1, and any
1X0, 0 ~ 1X0 < 1, there is one and only one lower-tailed randomized test ¢*
based on S for which ex(Po, ¢*) = ex o . (This is easy to see by considering what
happens as the lower tail is gradually augmented.) We will now prove that this
test ¢* uniformly maximizes the probability ofn;jection ex(P1' ¢) for P1 < Po
and uniformly minimizes ex(P1' ¢) for P1 > Po, among tests ¢ for which
ex(Po, ¢) = exo.1t will be left to the reader (Problems 48 and 49) to verify that
all statements of Sect. 7.1-7.3 follow (with the help of hints given there and the
relation between tests and confidence procedures, particularly the discussion
of including false values in Sect. 6.5).
Let ¢ be any (randomized) test. As in Sect. 5, denote by ¢(s) the prob-
ability that ¢ rejects the null hypothesis when a particular value s is observed.
Then

(7.1)
56 1 Concepts of Statistical Inference and the BInomIal DlstribUlJon

For PI < Po (H 0 false), we want to maximize the expression in (7.1), subject to


the condition that
(7.2)

It is intuitively clear that we should choose cP(s) = 1 for those s where the
contribution to the sum (7.1) is greatest compared to the contribution to the
sum (7.2). In other words, the maximizing cP will be of the form

,j,. (

'1'0 S
) = {I if A(S) > k
0 ifA.(s) < k
(7.3)

where A is the likelihood ratio given by

A(S) = PplS = s).


Ppo(S = s)
The value of k and the value of cPo(s) when the likelihood ratio equals k are
chosen so thata(po, cP) = a.Randomizationwilloccur(thatis,O < cPo(s) < 1),
if at all, only where A. = k.
Before proving this fact, we will first verify that (7.3) leads to a lower-tailed
test. The likelihood ratio is

..1.(s) = (=)pi(1 - PI)n-s = (1 - PI)n(PI . 1 - po)S (7.4)


(~)p~(1 - Pot S 1 - Po Po 1 - PI .
The values of s for which (7.4) is larger than a constant k are simply those
values of s which are less than a constant Sf> because (PI/Po)(l - Po)/(l - PI)
< 1 if PI < Po. Thus cPo is indeed a lower-tailed test. The crucial property
leading to this result is that the likelihood ratio is a monotone function of s.
If PI > Po, one can prove that a lower-tailed test minimizes a(pt> cP)
subject to a(po, cP) = ao by a similar argument or by suitably using the result
just proved.
The fact that (7.3) maximizes (7.1) subject to (7.2) is a special case of the
following, well known theorem. Since the result does not depend on the bi-
nomial distribution, the theorem is stated for any two distributions Po and PI
of any random variable S. Eo and EI denote the expected value operations
corresponding to the distributions Po and PI'

Theorem 7.2 (Neyman-Pearson Fundamental Lemma). Let Po and PI be


distributions oj a random variable S. Then EI[cP(S)] is maximized, subject to
the conditions
(i)
and
(ii) o s cP(s) S 1 Jor all s,
7 Properties of One-Tailed Procedures 57

by any function ¢ of the form

(iii) ¢(s) = {a1 if the likelihood ratio> k


if the likelihood ratio < k
provided the value of k and the values of ¢(s) at those s where the likelihood
ratio equals k are such that Eo[ ¢(S)] = oeo. Conversely, if ¢ maximizes
EI[¢(S)] subject to conditions (i) and (ii), then there is a constant kfor which
(iii) holds, and furthermore, Eo[¢(S)] = oeo unless there is a set A such that
P 1(S E A) = 1 and Po(S E A) < oeo.
If Po and P, are discrete, then the likelihood ratio is the ratio of the discrete
frequency functions:
.. . P,(S=s)
(iv) lzkellhood ratIO = Po(S = s)"
If Po and P, have densities fo and f, respectively,7 then

(v) II'k eI'h d .


I 00 ratIO
f,(s)
= fo(s)'
For the converse, (iii) is considered to hold if the set of s for which it fails has
probability a under both Po and P,.
PROOF(For more detail, see, e.g., Lehmann [1959]). Consideration of what
happens as k increases from a to 00 makes it clear that there exists a ¢o of the
form (iii) such that Eo[ ¢o(S)] = oeo; that is, there is a k such that Eo[ ¢o(S)]
= oeo if ¢o(s) satisfies (iii) and is defined suitably when the likelihood ratio
equals k. Let ¢o be so defined and suppose that ¢ satisfies (i) and (ii). Then,
in the discrete case, we shall show that
E,[¢o(S)] - E 1 [¢(S)] = L [¢o(s) - ¢(s)]P 1(S = s)
s

~ L [¢o(s) - ¢(s)]kPo(S = s)
s

= k{Eo[¢o(s)] - Eo[¢(S)]}
~ klXo - klXo = a. (7.5)
The first inequality holds term by term because ¢o(s) = 1 ~ ¢(s) where
PI(S = s) > kPo(S = s) and ¢o(s) = a ~ ¢(s) where P 1(S = s) <
kP o(S = s). The inequality is strict unless ¢ also satisfies (iii). The second
inequality holds because Eo[¢o(S)] = lXo ~ Eo[¢(S)] and k ~ a. It is strict
unless k = a or Eo[ ¢(S)] = oe o. It follows that E 1 [¢(S)] ~ E I[¢o(S)] for any
¢ satisfying (i) and (ii), thus proving the direct half of the theorem. If ¢ also
maximizes E 1 [¢(S)], then neither inequality is strict. It follows that ¢ satisfies

7 Actually, this case covers any two distributIOns if denSIties wIth respect to an arbItrary measure

are allowed In fact, we may take/o = dPo/d(P o + PI)./I = dP.ld(P o + PI)'


58 1 Concepts of StatIstIcal Inference and the Binomial DIstribution

(iii), and that E o[4>(S)] = ao, except perhaps when k = o. Thus the converse
half of the theorem is proved except when k = 0,4> satisfies (iii), and Eo[ 4>(S)]
< ao. In this case, the likelihood ratio is greater than k on the set A of all
s where Pl(S = s) > 0; then P 1(SEA) = 1, and 4>(s) = 1 for sEA, so that
P o(S E A) ~ Eo[ 4>(S)] < ao, satisfying the last clause of the theorem. 0

The proof for densities is similar and is left to the reader.*

8 Choice of Two-Tailed Procedures and Their


Properties

8.1 Test Procedures

In the last section we found that in the one-sided binomial problem, a one-
tailed randomized test based on S is uniformly most powerful against
alternatives in the appropriate direction. For two-sided alternatives it is
natural to use two-tailed tests, and later in this section, we shall show that no
others should be considered. No test is uniformly most powerful against a
two-sided alternative, however, since different one-tailed tests are most
powerful against alternatives on the two sides. Hence, even if the level a is
given, we need some further criterion to select among all two-tailed tests at
level a.
One possible criterion for choice is called the equal-tails criterion. The
idea is that a two-tailed test at level a should be a combination of two one-
tailed tests, each at level a/2. To make the exact levels equal, in most discrete
problems, one of the one-tailed tests must ordinarily be randomized. (If the
null distribution is symmetric, either both or neither must be randomized.)
The usual nonrandomized two-tailed binomial test (Clopper and Pearson
[1934]) discussed earlier in this book has equal nominal levels in the two
tails. Specifically, it rejects at nominal level a when either one-tailed, non-
randomized test at conservative level cx/2 would reject, and "accepts"
otherwise.
The usual two-tailed binomial test is therefore conservative, as was
illustrated in the context of confidence intervals by Fig. 6.3. In fact, it might be
considered ultraconservative, since it would sometimes be possible to add a
point to the rejection region without making the exact level greater than the
nominal level. For the null distribution given in Table 4.1, for example, at
level a = 0.06, the usual test of Ho: P = 0.6 would reject only for S = 0, 1,2,
or 10. Either S = 3 or S = 9 could be added to the rejection region without
raising the level above 0.06. The disadvantage of including such a point is that
the level of one of the one-tailed tests would then exceed a/2 so that the one-
tailed and two-tailed procedures would not be simply related. Still, under the
8 ChoIce of Two-Tailed Procedures and TheIr Properties 59

usual two-conclusion interpretation oftests, taken literally, or the first three-


conclusion interpretation (Table 4.2), it seems unarguable that the point
should be added when possible.
The primary advantage of using the criterion of equal tails is that it provides
two-tailed tests which are simply related to one-tailed tests, and the corres-
ponding confidence intervals have as endpoints the respective upper and
lower confidence bounds, each at one-sided level (1./2 (see Sect. 8.2). It is
difficult to internalize the inference if equal tails are not used. Nevertheless, in
discrete problems, it is difficult to argue against eliminating at least ultra-
conservatism if we consider the situation from a truly two-sided point of view,
rather than as a combination oftwo one-sided problems. Perhaps, however, a
truly two-sided view is rare. If, for instance, we adopt the second three-con-
clusion interpretation of tests (Table 4.3), the situation is different. Now to
achieve a desired error probability (I., the two tail probabilities (1.1 and (1.2 both
must be less than or equal to the same value, although that value is (I., not (1./2.
Another criterion which might be used to select among possible two-tailed
tests is "unbiasedness." A (two-conclusion) test is called unbiased if the prob-
ability of rejecting the nuB hypothesis when it is false is at least as great as
when it is true. In the binomial problem, for a specified nuB hypothesis p = Po
and exact level (I., there is one and only one unbiased two-tailed test (Problem
50). As we will show later in this section, this test is also uniformly most power-
ful unbiased, that is, uniformly most powerful among all unbiased tests at
level (I.. Ordinarily, however, it is not equal-tailed; it is randomized at both the
lower and upper critical values; and even adjusting (I. cannot eliminate
randomization in both tails (Problem 50d).
If a nonrandomized test is desired, it will thus generally be impossible to
find one which is unbiased, let alone uniformly most powerful unbiased. Then
one would presumably choose the nonrandomized test which is in some
sense most nearly unbiased. This sometimes gives results different from the
usual equal-tailed, nonrandomized test. Thus, with or without randomization,
the criterion of unbiased ness has the disadvantage of leading to two-tailed
tests which are not simply related to one-tailed tests, as equal-tailed tests are.
While it is pleasant to find that a test chosen on other grounds is unbiased,
unbiasedness is not really a satisfactory theoretical criterion for choosing
among procedures anyway. For instance, a biased test might be considerably
more powerful than an unbiased test except in a very small interval on one
side of the null hypothesis, as in Fig. 8.1. In such a case, the biased test may be
preferable to the unbiased one. To be sure, the criterion of equal tails is
subject to the same criticism. For example, the unbiased test in Fig. 8.1 might,
be equal-tailed, while the presumably better test is not. Situations like Fig. 8.1
may be rare in practice, but the possibility makes clear that our fundamental
preferences among power curves are not captured by either unbiasedness or
equal tails. This criticism is less telling against equal tails than against
unbiased ness, however, since the advantage claimed earlier for equal tails
has no relation to power.
60 I Concepts of Statistical Inference and the Binomial Distribution

Power
1.0 - -__=-........::-----+--------~.--,::;....

(J

Figure 8.1 Two hypothetical power functIons.

Another possible criterion for choosing critical regions is minimum like-


lihood, discussed in Sect. 4.5 in relation to P-values. By this procedure, the
rejection region would be made up of those sample points which have the
smallest probability under the null hypothesis, starting with the least probable
and working up until the total probability in both tails combined is as large
as possible without exceeding ex.
To illustrate, we consider again the binomial problem with n = 10, Ho:
p = 0.6, ex = 0.10; the point probabilities are given in Table 4.1. In order of
increasing probability, the first five points entering the rejection region are
S = 0, 1, 10, 2, and 9, and their total null probability is 0.000 + 0.002 +
0.006 + 0.010 + 0.040 = 0.058. Adding the next point, S = 3, to the rejection
region increases the rejection probability to 0.058 + 0.043 = 0.101, which
exceeds 0.100, so that S = 3 cannot be included for a test at level 0.10. By
the minimum likelihood criterion then, the conservative, nonrandomized,
two-tailed test of the null hypothesis p = 0.6 at the 0.10 level rejects when
S = 0,1,2,9, or 10, and "accepts" otherwise. The test coincides with the usual
equal-tailed test in this instance. At level 0.06, however, the two methods
differ. The equal-tailed test rejects only for S = 0, 1,2, or 10, while the minimum
likelihood test still rejects for S = 0, 1,2,9, or 10. The randomized minimum
likelihood procedure at exact level .10 rejects when S = 0, 1,2,9, or 10 and
with probability 0.042/0.043 when S = 3, and differs from the randomized,
equal-tailed test at the same exact level 0.10.
In any situation, a minimum likelihood test at exact level ex which is based
on a sufficient statistic S having a unique null distribution is most powerful
against the alternative that all values of S are equally likely, among all tests at
level ex, randomized or not (Problem 52). In the binomial case, although there
is no value of p under which all values of S are equally likely, they would be if p
were a random variable having the uniform distribution on (0, 1) (Problem
54). Thus the minimum likelihood test using the binomial distribution is most
powerful against the alternative that p is uniformly distrubuted between 0 and
1. An equivalent property, which does not depend on regarding p as a random
variable, is that the minimum likelihood procedure in the binomial case
maximizes the average power, that is, the integral of the power over p,
8 Choice of Two-Tailed Procedures and Their Properties 61

g P p(Po rejected)dp. Equivalently, it maximizes the area under the power


curve, from 0 to 1. The power is not defined at the null hypothesis p = Po but
the integral or area would not be changed by omitting this single value. Like
the unbiased test, the test based on the minimum likelihood criterion has the
disadvantage that it is not simply related to one-sided procedures. The
advantage just described, however, precludes the conceivable sort of dis-
advantage of unbiased ness represented in Fig. 8.1. Average power is closer to
our real concern than unbiasedness. Although reparameterization changes it,
this corresponds to using a weighted average. In the Bayesian framework, the
appropriate weights reflect the prior distribution and cost/benefit considera-
tions or the" loss function."
In performing two-tailed tests, either in general or in binomial situations,
statistical methodology does not clearly prescibe a particular procedure as
optimum, whether it is randomized or nonrandomized. If one attempts to
escape this problem by reporting two-tailed P-values, the difficulties discussed
in Sect. 4.5 apply.

8.2 Confidence Procedures

If confidence regions are to be obtained from two-tailed tests, some procedure


must be adopted to choose upper and lower critical values for any null
hypothesis P = Po since an escape into P-values is not available. Most tables
of confidence limits are constructed, following Clopper and Pearson [1934],
to correspond to the usual equal-tailed, nonrandomized test of P = Po. Then
the upper and lower ends of the confidence region are the one-sided confidence
bounds each at confidence level 1 - rx/2, and for each Po the one-sided and
two-sided confidence procedures are simply related.
Since the usual test is conservative, the corresponding confidence pro-
cedure is also conservative, as is illustrated in Sect. 6.4, and in fact even
ultraconservative as explained in relation to testing in Sect. 8.1. However,
when the advantages of the criterion of equal tails are sacrificed, it may be
even more difficult to internalize the inference with confidence regions than
with tests. Unequal tails can be quite misleading, especially if confidence is
interpreted as if the Bayesian concept of subjective probability applied, that is,
as the probability, after the observations are made, that the true value is in
the confidence region (as explained in Sect. 6.3).
The confidence procedure corresponding to a family of tests, each of which
is unbiased at exact level rx, has the property that the probability of including
the true value is at least 1 - rx for any true value, while the probability of
including any particular false value is at most 1 - ex for any true value. Such a
confidence procedure will be called unbiased at confidence level 1 - ex. s

8 If the tests are unbiased but have different exact levels for different members of the family, the

corresponding confidence property IS more complicated and will not be discussed here. (Problem
53 requests a statement of thiS property.)
62 1 Concepts of StatIstical Inference and the BmomIaI DIstribution

Unbiasedness seems an even less important property of a confidence interval


than of a test, and equal tails even more important.
The confidence regions which correspond to the minimum likelihood
procedure, strangely enough, are not necessarily intervals. When only
nonrandomized procedures are to be considered, however, it is apparently
possible to modify the procedure so as to avoid this difficulty while preserving
the property, discussed above, of maximizing the average power. For a non-
randomized test, we have (Problem 55)

average power = s: Pp(Po rejected)dp

= (number of values of S such that Po rejected)/(n + 1).


(8.1)

This may be maximized by more than one level IX test. Crow [1956J took
advantage of this flexibility to modify the minimum likelihood procedure so
that the corresponding confidence regions are intervals (for n :s; 30 and
rt. = 0.10, 0.05, or 0.01). Crow made further modifications to shorten the
intervals for S near 0 or n and to make the interval for a given S at level 0.01
contain that at level 0.05 and the interval at level 0.05 contain that at level 0.10.
Crow's confidence procedure, and any other confidence procedure cor-
responding to nonrandomized level rt. tests which maximize the number of
values of S in each rejection region, have the following properties among
nonrandomized procedures at level rt.. First, they minimize the average pro-
bability of including each point Po, SA Pp(Po included)dp, which can also be
interpreted as the average probability of including the false value Po. (In the
integrand, if p is the true value and p =F Po, then Po is a false value; omitting
p = Po from the region of integration has no effect.) Second, they also
minimize both the average expected length of the confidence region and the
ordinary, arithmetic average of the lengths of the n + 1 possible confidence
regions, these being in fact equal (Problem 56). Specifically, for intervals these
averages are

f Ep[p,,(S) - p(S)Jdp = --1


n+
in
L [p,,(s) -
s~o
pAs)]. (8.2)

If, for some s, the confidence region C(s) is not an interval, its generalized
length (Borel measure) JC(S) dp must be used here in place of the length
p,,(s) - pAs). (Since" Po included" is equivalent to .. Po not rejected," the first
property follows from (8.1) and the second from the relation

±i
s=O CIs)
dp = it
0
[number of values of s such that p E c(s)Jdp, (8.3)

whose proof is Problem 58.)


8 Choice of Two-TaIled Procedures and TheIr Properties 63

The solution of the corresponding problem for randomized procedures


cannot be chosen so that the confidence region is always an interval, because
there is flexibility in the choice of tests only at isolated values of Po [Pratt,
1961].
By a technique similar to that above, it is possible quite generally to
minimize the average expected length of a confidence interval using any
weight function in averaging. The essential step is to multiply (6.11) by the
weight function and integrate with respect to 01'

*8.3 Completeness and Admissibility of Two-Conclusion


Two-Tailed Tests

So far we have assumed, on an intuitive basis, that a two-sided binomial


problem calls for a two-tailed test, rather than some more complicated type of
test. This can be justified in terms of the concepts of completeness and ad-
missibility introduced earlier in Sect. 7.2.
Consider first the two-conclusion interpretation in a two-sided binomial
problem with null hypothesis p = Po. It can be shown that the two-tailed tests
form a complete class and somewhat more: Given any procedure which is not
a two-tailed test, there is a two-tailed test with the same exact level and greater
power everywhere, that is, the same probability of rejection when p = Po
and greater probability of rejection for any p =F Po. 9 This justifies restricting
consideration to two-tailed tests. Furthermore, none ofthese may be excluded
from consideration without introducing some further criterion, because all
two-tailed tests are admissible: Given any two-tailed test, no other test has
power as great for every P =F Po and at the same time exact level as small.
An interesting generalization also holds. The null hypothesis p = Po is
sometimes used as an approximation to the null hypothesis that p lies in a
specified interval containing Po. That is, we might prefer to "accept" for p
ihside some interval, and to reject for p outside it. At the endpoints we may
prefer "acceptance" or rejection, or we may be indifferent. (We assume that
the endpoints of the interval are different and also that neither endpoint is
oor 1. The degenerate case where they are equal is the situation discussed in
the previous paragraph.) The previous result, that the two-tailed tests are
admissible and form a complete class, holds here also. Explicitly, given any
procedure which is not a two-tailed test, there is a two-tailed test which has
greater probability than the given test of "accepting" when p is inside the
interval and of rejecting when p is outside the interval (and the same prob-
ability as the given test of" accepting" and rejecting at the endpoints). Further,

9 The defillltlOn of completeness requIres only" at least as great" The statement is true only if
one-tailed tests are included in the class of two-taIled tests, as we shall take them to be and as they

°
are by the formal defillltlOn of two-taIled tests gIven above. In the binomial problem, we can take
Sf = 0, tPt = or Sa = II, tPa = I
64 I Concepts of StatIstical Inference and the BInomIal DistrIbutIOn

given any two different two-tailed tests, each one has greater probability of
making the correct decision than the other at some values of p. Thus we need
consider only two-tailed tests, but none can be excluded from consideration
without adducing some further criterion.
The facts of the last two paragraphs have been proved by Karlin [1955] for
any strictly P6lya type 3 distribution. (The monotone likelihood ratio property
mentioned in Sect. 7.4 is P6lya type 2. P6lya type 3 is a generalization.)
Karlin's proof uses a fundamental theorem of game theory, that the class of all
Bayes' procedures and their limits is complete, under certain compactness
conditions. One could use instead the generalization of the Neyman-Pearson
fundamental lemma to two side conditions. The proof of this generalization
(see, for instance, Lehmann [1959]) is related to the usual proof of the funda-
mental theorem of game theory. With the help of completeness, a proof of
admissibility is analogous to that given in Sect. 7.4.*

*8.4 Completeness and Admissibility of Three-Conclusion


Two-Tailed Tests

Results similar to those of Sect. 8.3 hold for the various three-conclusion inter-
pretations of two-sided problems. For definiteness (the discussion applies to
the others also with trivial modifications), we consider the conclusions as
(a) P < Po, (b) "accept" the null hypothesis (that is, draw no conclusion), and
(c) P > Po. Suppose that when in fact P < Po, we prefer (a) to (b) and (b) to (c);
when P = Po we prefer (b) to either (a) or (c); and when p > Po we prefer (c) to
(b) and (b) to (a). Then, given any procedure for reaching one of these con-
clusions, there is a two-tailed test which (in its three-conclusion interpretation)
is at least as good for every p. Specifically, if !Xl equals the probability when
p = Po of concluding p < Po and !Xl equals the probability when p = Po of
concluding P > Po by the given procedure, then the two-tailed test combining
the lower-tailed test at exact level !Xl and the upper-tailed test at exact level !Xl
is at least as good as the given procedure, whatever the value of p. When
P < Po, its probability of concluding P < Po is at least as large, and its prob-
ability of concluding p > Po is at least as small, as that of the original pro-
cedure. (This follows from the results given in Sect. 7.) When P > Po, the
same statement holds with the inequalities reversed; and when P = Po, the
two procedures have the same probability of leading to each conclusion.
Thus the two-tailed tests form a complete class in the three-conclusion
interpretation (with the natural preferences given above). Are all admissible in
this interpretation, or might there now be one which is at least as good as
another whatever the value of P and better for some p? The results given earlier
for one-tailed tests imply immediately that, given any two-tailed test inter-
preted as a three-conclusion procedure, any procedure having greater
probability under any P > Po of correctly concluding p > Po has also greater
probability under all p ~ Po of incorrectly concluding p > Po. This statement
9 AppendIces to Chapter I 65

also holds with the inequalities reversed. These results are not quite what we
would like however, because we might prefer a procedure having somewhat
smaller probability, under P > Po, of correctly concluding P > Po, if it also
had a sufficiently smaller probability of incorrectly concluding P < Po (and
therefore, of course, larger probability of "accepting" the null hypothesis).
That is, a sufficient decrease in the probability of the least desirable conclusion
P < Po might more than offset a decrease in the probability of the most
desirable conclusion P > Po.
In order to make the problem definite enough to investigate, let us suppose
that the undesirability of a procedure, when P is the true value, is measured by
weighting the probabilities of errors and adding. The weights may depend on
P and are denoted by L}(p) for the conclusionsj = a,b and c. Specifically, then,
when P is the true value, the undesirability is the sum
La(P)P p( test concludes P < Po)
+ Lb(P)P p(test "accepts") (8.4)
+ Le(P)P v<test concludes P > Po).
In accordance with the preferences expressed earlier, we have
Lip) < Lb(P) < Le(P) for P < Po (8.5)
(8.6)
(8.7)
Beyond this, the L j may be chosen almost arbitrarily and the statement we are
about to make will still hold.
If undesirability is interpreted as the sum given in (8.4), and a mild further
restriction is satisfied by the L j , it can be proved (Karlin and Rubin [1956];
Karlin [1957b]; Problem 59) that any two-tailed test whose upper and lower
critical values are not equal is admissible; that is, no other procedure is as
desirable under all P and more desirable under some p. (For the sample sizes
and null hypotheses that occur in practice, tests whose upper and lower
critical values are equal have large probability of rejecting the null hypothesis
when it is true and hence are never considered.)
These statements about the three-conclusion interpretation of two-sided
binomial problems also hold if the null hypothesis p = Po is replaced by the
null hypothesis that P lies in an interval containing Po, as was done above for
the two-conclusion interpretation. *

9 Appendices to Chapter 1
Straightforward tabulation of the binomial distribution involves three
variables (n, p, and s) and leads to extremely bulky tables. (Table B is straight-
forward but very abbreviated and is not useful for in-between values, for large
66 I Concepts of StatIstical Inference and the Binomial Distribution

n, or for finding confidence limits on p.) Straightforward machine computation


is often inefficiently lengthy, and is sometimes infeasible because of overflow,
underflow, or roundoff error. For these reasons, approximations are widely
used. Two common approximating distributions are the Poisson (which
requires a two-variable table) and the normal (which requires only a one-
variable table). Simple limiting processes by which the binomial distribution
leads to the Poisson and normal distributions will be developed in Appendix
A. The accuracy of approximations can be vastly improved, however, by
various kinds of transformations and adjustments. The appropriate computing
forms of the resulting approximations may obscure their origins. The ap-
proximations suggested at the ends of Tables Band C are examples, but
discussion of such matters would be out of place here.
Some basic aspects of limiting distributions generally will be discussed in
Appendix B, primarily as background for Chap. 8.

Appendix A: Limits of the Binomial Distribution

A.l Ordinary Poisson Approximation and Limit

A random variable S has a Poisson distribution with mean m if


mSe- m
P(S = s) = - , - for s = 0, 1, .... (9.1)
s.
If S is binomial with parameters nand p, and n is large but np is moderate, then
S has approximately a Poisson distribution with mean m = np, and (9.1)
holds approximately. It follows that
s" mSe-m
P(s' $ S $ s") == L -,-
s=s· s.
(9.2)

where s' and s" are integers, 0 $ s' $ s", and == denotes approximate equality
(in absolute, not relative terms). If n is large but n(1 - p) is moderate, then
n - S has approximately a Poisson distribution with mean m = n(1 - p).
This yields immediately an approximation to the distribution of S.
Precise limit statements corresponding to these approximations are as
follows. Suppose Sn is binomial with parameters nand p, and p depends on n
in such a way that np -+ m as n -+ 00. (The subscript has been added to S
because n is no longer fixed and the distribution of S depends on n.) Then
mSe- m
P(Sn = s) -+ - - I- as n -+ 00, S = 0, 1, ... (9.3)
s.
s" mSe-m
P(s' $ Sn $ s") -+ L - - I- as n -+ 00, 0 $ s' $ s" $ 00. (9.4)
s=s' s.
9 Appendices to Chapter 1 67

In fact, for any set of integers A,


mSe- m
P(Sn EA) .... L -- as n .... 00. (9.5)
SEA s!

Of course, these limit statements, although they have precise meanings, say
nothing about when the limits are approximately reached.
*The statements in (9.3) and (9.4) are easily proved as follows. The exact
probability function for Sn is

P(Sn = s) = (:)pS(1 _ p)n-s

= ~ [~ n -
s! n n
1... n - ns + 1J (np)S[(1 _ p)1/p]n p-s p. (9.6)

As n .... 00, each factor in the first bracket approaches 1, np .... m, and therefore
p .... 0; it follows that (1 - p)1/p .... e- 1 and sp .... 0, which proves (9.3).
Summing (9.3) over s gives (9.4) for s" < 00. Limits cannot be taken under
infinite summation signs without further justification, but (9.4) now follows
for s" = 00 as well, since

s'-1 mSe-m 00 mSe-m


.... 1 - I -,-
s=O s.
= I -,-.
s=s' s.
(9.7)

The proof of (9.5) will be given in Appendix B.*

A.2 Ordinary Normal Approximation and Limit

The density function of the standard normal distribution was given earlier in
(2.3) as

(9.8)

In this book the symbol <I> will be reserved to denote the cumulative distri-
bution function of the standard normal. Values of <1>( - z) are given in Table A
for z ~ 0; by symmetry, <I>(z) = 1 - <1>( -z). A random variable X is normal
with mean Jl. and variance (J2 if the standardized random variable, (X - Jl.)/(J,
has a standard normal distribution.
If S is binomial with parameters nand p, and np and n( 1 - p) are both large
(that is, n is large and p is not too close to 0 or 1), then S is approximately
68 I Concepts of StatIstIcal Inference and the Bmomml DlstnbutlOn

normal with mean Jl. = np and variance (12 = np(l - p). In other words, the
standardized random variable

S - Jl. S- np (S/n) - p (S ) ~
-(1- = Jnp(l- p) = Jp(l- p)/n = n- p V~
(9.9)

is approximately standard normal.


Since the normal distribution is continuous while the binomial is discrete,
the approximation is generally improved by "correcting for continuity." This
amounts to associating the binomial probability at each integer s with a
continuous distribution on the interval between s - ! and s + !. Specifically,
letting Z denote a standard normal random variable, peS = s) for an integer s
can be approximated by either

pe - !- Jl. ~ Z ~ s + : - Jl.) = $(s + : - Jl.) - $e -:-Jl.)


or

(9.10)

where Jl. = np, (J = Jnp(1 - p). Further, for any integers s' and S", s' ~ S",
we have approximately

pes' ~ S ~ S") == P ( -
S'l ! - Jl. ~ Z ~ SIl+l)
: - Jl.

= $e" + : - $e' -!-Jl.) - Jl.). (9.11)

The values of$ are given by Table A. This approximation can be recommen-
ded only for extremely casual use, where ease of remembering and calculating
are paramount. There is a normal approximation based on cube roots which
is only a little more complicated but an order of magnitude more accurate.
Another normal approximation based on logarithms is only slightly more
complicated and is yet another order of magnitude more accurate. The latter
is given at the end of Table B and the former is the basis of the approximate
confidence bounds given at the end of Table C.
Precise limit statements corresponding to the approximations (9.10) and
(9.11) are as follows. Suppose Sn is binomial with parameters nand p and p is
fixed,O < p < 1. If Sn is an integer depending on n in such a way that (sn - Jl.)/(J
approaches some number, say z, then
P(Sn = sn) -t 1 as n - t CfJ (9.12)
(l/(1)¢(z)
and
(9.13)
9 Appendices to Chapter 1 69

Regardless of how s depends on n, we have

f.l) ~ 0
P(Sn :::; s) - q, ( -s (-J - as n ~ 00. (9.14)

Replacings in the second term of(9.14) by s + 1- generally makes the left-hand


side nearer 0 without affecting the limit; subtracting (9.14) for s = s" from
(9.14) for s = s' - 1 then leads to (9.11). Again, the limit statements have
precise meanings but say nothing about when the limits are approximately
attained. Further, (9.14) does not imply that the ratio of the approximate and
true probabilities approaches one, and in fact it does not always.
PROOFS. The statement in (9.12) can be proved by straightforward but tedious
calculation based on Stirling's formula, which says that
n!
--===--- ~ 1 as n ~ 00.
j2im(n/e)n
Equation (9.13) can be proved from (9.12) by proving that (9.12) holds
uniformly [Feller, 1968] or by another argument (see Appendix B), or, with-
out (9.12), by applying the Central Limit Theorem for identically distributed
random variables (Theorem 9.2). Equation (9.14) follows from (9.13)
(Problem 60). 0

Appendix B: Convergence in Distribution and Asymptotic


Distributions

In Appendix A, we showed that the Poisson and normal distributions may be


used as approximations to the binomial distribution. Now we will discuss
convergence of distributions and densities more generally.

B.1 Convergence of Frequency Functions and Densities

Suppose that for each n we have a sequence of random variables, say {Xn}, and
a corresponding sequence of distributions. Suppose further that, as in the
previous section, the random variables X n are discrete and, for every real
number x, the probability that Xn = x, say fn(x), approaches a limit, say
f(x), as n ~ 00. Then as n ~ 00

P(Xn = x) = fn(x) ~ f(x) for all x.


It follows immediately that

P(Xn E A) = L fn(x) ~ L f(x) for any finite set A.


xeA xeA
70 I Concepts of Statistical Inference and the Bmomlal DlstnbutlOn

This is true for infinite sets as well if the limit f is a discrete frequency function.
Specifically, we must have Lx
f(x) = 1, which cannot be taken for granted.
This added condition also holds when f is the Poisson frequency function, of
course, and the proof of (9.4) for s" = 00 used it.
An analogous result holds for densities. These facts and others are given
formally in the following theorem.

Theorem 9.1
(1) If Xn has discrete frequency function fn, n = 1,2, ... , fn(x) -. f(x)for all
x, and Ix
f(x) = 1, then f is the discrete frequency function of a random
variable X, and
(i) I /J,,(x) - f(x)/-. 0 (9.15)
x

(ii) P(Xn E A) = I fn(x) -. P(X E A) =I f(x) (9.16)


xeA xeA

for every set A, and


(iii) E[h(Xn)] = I h(x)fn(x) -. E[h(X)] = I h(x)f(x)
x x

for every bounded function h.


(2) If Xn has density fn, n = 1,2, ... , J,,(x) -. f(x) for all x, and Jf = 1,
then f is the density of a random variable X and

(i') (9.17)

(ii') (9.18)

for every set A, and

(iii')

for every bounded function h.1O


(3) In both (1) and (2), if Xn has cumulative distributionfunction Fn and X has
cumulative distribution function F, then
(iv) Fn(x) -. F(x) for all x. (9.19)
PROOF. (i) implies (ii) uniformly in A and (iii) uniformly in h for /h / ~ K, while
(ii) implies (iv). The same holds for (i'), (ii'), and (iii'). Hence we need prove
only (i) and (i'). Since
J,,(x) + f(x) - /J,,(x) - f(x) / ~ 0 for all x,

10 Here, as els.:where III this book, all sets and functions are assumed to be measurable without
specific mentIOn.
9 AppendIces to Chapter I 71

it follows (Problem 61) that


L lim inf[J,.(x) + f(x) - Ifn(x) - f(x) IJ
x

:-:; lim infL [fn(x) + f(x) - lJ,.(x) - f(x)IJY (9.20)


x

Since J,.(x) -+ f(x) and Lx f(x) = 1, this reduces to


2 :-:; 2 - lim sup I: IJ,.(x) - f(x) I or lim sup L Ifn(x) - f(x) I :-:; O.
x x

The statement in (0 follows. The proof of (i') is similar, using Fatou's Lemma,
which says that

f lim inf hn :-:; lim inf f hn for hn ~ O.

Actually, the densities may be with respect to an arbitrary measure. With this
understanding, the second case covers the first. (The method of proof used
here appears in a more natural context in Young [1911]. See also Pratt [1960J.
A somewhat different proof is given by Scheffe [1947].) 0

The foregoing discussion does not apply directly to the approach of the
binomial distribution to the normal, since the binomial distribution is discrete
while the normal has a density. The discussion does apply indirectly, however.
Specifically, suppose Sn is binomial with parameters nand p, and U is uni-
formly distributed between -! and 1. Then Sn + U has a density; in fact, the
density of Sn + U is
(9.21)
where s is the integer nearest y. (The definition for y half-way between adjacent
integers is immaterial.) This density approaches 0 as n -+ 00 for y fixed, as
would be expected since the variance of Sn + U approaches 00. Consider

X = Sn + U - J.l. (9.22)
n (J'

where J.l. = np and = Jnp(l


(J' - p) as before. The density of Xn is
fn(x) = (J'gn(J.l. + x(J')
= (J'P(Sn = s) (9.23)
where s is the integer nearest J.l. + X(J'. For this s, (s - J.l.)/(J' -+ X, so that by
(9.12)
J,.(x) ~ 1 for all x (9.24)
f(x)

II The abbreviatIOn mf stands for mfimum, whIch IS the greatest lower bound. Snntlarly, sup
denotes supremum or least upper bound. The infimum and supremum of a set of numbers always
exist; eIther or both may be mfinite
72 I Concepts of Statistical Inference and the Blllomial DistributIOn

where f is the standard normal density. Thus fn(x) ...... f(x) for all x, and
Theorem 9.1 applies.
It follows in particular that

Fn(x) = P(Xn S x) ...... F(x) for all x (9.25)

and hence (Problem 60) that (9.13) holds.

B.2 Convergence in Distribution

Suppose X 1> X 2, .•. , and X are real- or vector-valued random variables with
respective cumulative distribution functions F b F 2, ... , and F. Then the
following conditions are equivalent (Problem 62)

(1) Fn(x) ...... F(x) for every x at which F is continuous.


(2) E[h(X n)] ...... E[h(X)] for every bounded, continuous function h.
(3) P(Xn E A) ...... P(X E A) for every set A such that P(X is a boundary point
of A) = O.
(4) lim inf P(Xn E A) ~ P(X E A) for every open set A.
(5) lim sup P(Xn E A) S P(X E A) for every closed set A.

If these conditions hold (if one holds, they all do, since they are equivalent),
then Fn is said to converge in distribution to F. Alternative terminology is that
X n con verges in distribution to X, or X n to F, or Fn to X; X n or Fn is as ympto-
tically distributed as X or F; F is the limiting distribution of Xn or Fn; etc.
Part of the definition of convergence in distribution is that the limit F
should be a cumulative distribution function. It is possible for (1) to hold
without F being a cumulative distribution function (Problem 63); then Fn
does not converge in distribution to F (though the customary terminology is
to say that it converges "weakly" to F). If(l) holds, F must satisfy the mono-
tonicity properties of a cumulative distribution function; thus the further
requirement is just that it should behave properly at ± 00, which amounts to
the requirement that X be finite with probability one.
Conditions (1), (2), and (3) above are somewhat weaker than the cor-
responding statements of Theorem 9.1. Thus the hypotheses of Theorem 9.1
imply convergence in distribution, while convergence in distribution does not
imply the conclusions (or hypotheses) of Theorem 9.1, even if all the distribu-
tions are discrete. (It does if all the distributions are concentrated on the same
finite set of points. The proof is Problem 64.)
Notice also that the convergence in distribution of X n to X does not imply
that X n is probably near X for n large. Xl, X 2, •.. , X might be independent;
indeed, their joint distribution is not under discussion and need not exist. The
convergence in distribution of Xn to X says only that the distribution of Xn
is close to the distribution of X for n large, in a certain sense of close.
9 Appendices to Chapter I 73

The most important limiting distribution in practice is the normal. A


special terminology is convenient and often used, but must be handled with
care, especially in mathematical arguments. A sequence of real random
variables Zn is said to be asymptotically normal with mean Iln and standard
deviation an (or variance a;) if0 < an < ooand(Zn - Iln)/anhasasymptotical-
ly a standard normal distribution. This terminology suggests approximating
the distribution of Zn by the normal distribution with mean Iln and standard
deviation an' This is reasonable; it is the same as approximating the distribu-
tion of (Zn - Iln)/a n by the standard normal distribution, which is what the
definition suggests, and what would have to be done anyway to make the
standard normal tables apply. In mathematical arguments, however, the use
of the special terminology can be misleading, because it suggests that the
normal distribution with mean Iln and standard deviation an is in some way a
limit, which of course is not true since it depends on n. Furthermore, if a
parameter is involved and an is of smaller order for some values of the param-
eter than others, usual statements and proofs may not apply to these values.
If X is normal, then P(X is a boundary point of A) = 0 for ordinary sets A.
Then, if Xn converges in distribution to X, by (3), we have P(X II E A) --+
P(X E A) for ordinary sets A.
In particular, let A be the set of integers. Then P(X E A) = 0 for any
normal X. IfSnis binomial, P(Sn E A) = 1, so that P(Sn E A)does not approach
O. This appears to contradict either (3) or the statement that Sn is asymptotical-
ly normal, but it actually does not. The point is that Sn does not converge in
distribution, but rather that (Sn - Iln)/a ndoes for suitable Iln and an; hence (3)
applies to (Sn - Iln)/a n rather than to Sn. This illustrates one way the special
terminology can mislead.

B.3 Two Central Limit Theorems

The following theorems give two frequently applicable, convenient conditions


for asymptotic normality. (Proofs are given in such texts as Cramer [1946J,
Doob [1953], Loeve [1955], Fisz [1963]).

Theorem 9.2 (Central Limit Theorem for independent, identically distributed,


real random variables). If X b X 2' ... are independent observations on a
distribution with finite mean Il and finite variance a 2 , then

Ii Xj - nil
aJn (9.26)

converges in distribution to the standard normal distribution; that is, Ii X J


is asymptotically normal with mean nil and variance nu 2 , and X = 2:i Xin is
asymptotically normal with mean Il and variance a 2 /n.
74 I Concepts of Stallstlcal Inference and the Binomial Distribution

Theorem 9.3 (Liapounov Central Limit Theorem). If X 1> X 2, .•• are inde-
pendent real random variables with possibly difforent distributions, each having
finite absolute moments of the order 2 + (j for some number (j > 0, and if
I1 E(IXj - Ilj(2H)
(Ii UJ)1 H/2 -t 0 as n - t 00 (9.27)

where Ilj = E(X) and uJ = var(Xj ), then


Ii (Xj - Ilj) (9.28)
(I1 U])1/2
converges in distribution to the standard normal distribution; that is, Ii Xj
is asymptotically normal with mean Ii Ilj and variance I1 uI, and X =
I1 Xin is asymptotically normal with mean Ii Ilin and variance Ii uJ/n 2 •
Both of these theorems apply to the binomial distribution, since S can be
represented as the sum of n random variables defined by Xj = 1 if the jth
trial is a success, and Xj = 0 otherwise (Problem 65).
Actually, Theorem 9.3 applies whenever Theorem 9.2 does ifthe moment of
order2 + (j is finite. In thiscase,for alljwe have uJ = u 2 andE(IX j - IlJI2H)
= C, say, so that the left-hand side of (9.27) becomes
nc c
(9.29)

which indeed approaches 0 as n - t 00. This illustrates the factthat the left-hand
side of (9.27) has a tendency to approach 0 at the rate n - d/2 for some () > 0, so
the absolute moments E(IX j - Ilj l2+6) must misbehave quite badly before
(9.27) will fail.

PROBLEMS

1. Show that for any estimator T,,(X 1, ... , X n) of a parameter (J based on a sample of
size n, iflim n _ oo E(T,,) = () and lim n_ oo var(T,,) = 0, then T" is consistent for (J, that
is, (3.5) holds.
2. Show that, for n Bernoulli trials, the probability that s successes occur on s specified
trials is the same regardless of which s of the n trials are designated as successes.
3. Complete the proof of the equivalence of the three conditions for sufficiency given
in Sect. 3.3
(a) for the binomial distribution.
(b) for an arbitrary frequency function.
(c) for an arbitrary density function.
4. (a) Show that, if S is binomial with parameters nand p, then S(n - S)/[n(n - 1)]
is a minimum variance unbiased estimator of p(1 - p).
(b) Show that max p(l - p) = !.
(c) Show that the maximum value of the estimator in (a) exceeds! by 1/4n for n
odd, and by 1/(4n - 4) for n even.
Problems 75

5. (a) Show that the mean squared error of an estimator T of a parameter ecan be
reduced to 0 for a particular value of 0 by suitable choice of T.
*(b) Suppose the same estimator has mean squared error 0 for two different values
of e. What unusual condition would follow for the distribution of the obser-
vations?
*6. (a) Show that the nonrandomized estimator T(S) defined in Sect. 3.4 has the same
mean as the randomized estimator Ys , and smaller variance unless they are
equal with probability one.
(b) Generalize (a) to any statistic S, sufficient or not, in any problem, and to any
loss function v(y, e) which is convex in the estimator y for each value of the
parameter O.
*7. Let X be any random variable. Show that
(a) The points of positive probability are at most countable.
(b) There exists a finite or countably infintte set A such that P(X E A) = 1 If and
only if the points of positive probability account for all of the probability.
(Either of these conditions could therefore be used to define "discrete" for
random variables or distributions.)
8. Give (real or hypothetical) examples of results of tests of binomial hypotheses
which are
(a) not statistically sIgnificant but are apparently practically sIgnificant.
(b) statistically significant but are practically not significant.
9. Show that if a test of hypothesis has level 0.05, then it also has level 0.10.
10. Graph the power of the upper-tailed binomial test at level 0.10 of H 0: P :0; 0.6 in
the case where n = 10.
11. Show that any two tests which are equivalent have the same exact level and the
same power against all alternatives, but not conversely.
12. Show that two test statistics are equivalent if they are strictly monotonically
related.
13. Suppose that the rejection region of a lower-tailed binomial test is S :0; Sc> and let
tJ.(p) = Pp(S :0; sc). Show that as P increases, tJ.(p) decreases for any fixed II and So
so that tJ.(Po) > tJ.(p) for any p > Po and tJ.(Po) < tJ.(p) for any p < Po. Hence, if the
null hypothesis is H 0: p ~ Po and Sc is chosen in such a way that ex(Po) = tJ., then tJ. is
the maximum probability of a Type I error for all null distributions, and 1 - ex is the
maximum probability of a Type II error for all alternative distributions, and both
these maxima are achieved for the" least favorable case" p = Po.
14. Suppose that 1 success is observed in a sequence of 6 Bernoulli trials. Is the sample
result significant for the one-sided test of Ho: p ~ 0.75
(a) at the 0.10 level?
(b) at the 0.05 level?
(c) at the 0.01 level?
What is the level of just significance (critical level)?
15. Let p be the proportion of defective items in a certain manufacturing process. The
hypothesis p = 0.10 is to be tested against the alternative p > 0.10 by the following
procedure in a sample of size 10. "If there is no defective, the null hypothesis is
76 I Concepts of Statistical Inference and the Bmomial DistributIOn

accepted; If there are two or more defectives the null hypothesis is rejected; if there
is one defective, another sample of size 5 is taken. In this latter situation, the null
hypothesis is accepted if there is no defective in the second sample, and it is rejected
otherwise."
(a) Find the exact probability of a Type I error for this test procedure.
(b) Find the power of the test for the alternative distribution where p 2 0.20.
(c) Graph the power curve as a function of p.

16. A manufacturing process ordInarily produces items at the rate of 5 % defective. The
process is considered .. in control" if the percent defective does not exceed 10%.
(a) Find a procedure for testing H 0: p S 0.05 for a sample of size 20 and a signifi-
cance level of 0.05.
(b) Find the power of this procedure when p = 0.10.
(c) A sample of size n is to be drawn to test the null hypothesis Ho: p = 0.05 against
the alternative p > 0.05. Determine n so that the level is 0.10 and the power
against p = 0.10 is 0.30.

17. Let p be the true proportion of voters who favor a school bond. Suppose we use a
sample of size 100 to test H 0: p S 0.50 against the alternative HI: p > 0.50, and 44
are in favor. Find the P-value. Does the test "accept" or reject at level 0.10?

*
18. A particular genetic trait occurs in all individuals in a certain population with
probability either or !. It is desired to determine which probability applies to this
population. If a sample of 400 is to be taken, construct a test at level o.ot for the null
hypothesis Ho: P = i·
(a) Find the power of this test.
(b) If 60 individuals in the sample have this genetic trait, what decision would you
reach?

19. (a) Suppose that k individual tests of a null hypothesis Ho are given, and their
respective exact levels are IX., ... , IXk. Show that the combined test, which re-
Jects H0 if and only if at least one of the given tests rejects H0, has exact
level IX S IX* where IX* = IXI + ... + IXk.
(b) Under what circumstances can we have IX = IX*?
*(c) If the individual tests are independent under Ho, show that ex*/(1 + ex*) <

m;1
IX :s; IX*. (ThiS Implies, for Instance, that IX* does not overstate IX by more than
10% if IX :s; 0.10.) (Hint: Show that 1 - ex = (1 - ex,) and 1 + ex* :s;
n~; I (1 + IX,), and multiply.)
(d) If the individual tests have possibly different null hypotheses HoI' i = I, ... , k,
show that (a) applies with H 0 = n~; I H 0" the irltersection of the H 0,.

20. Define a two-tailed test T(ex) at level IX by combining two conservative one-tailed
tests at levellX/2, as at (4.7) and (4.8). Let ex*(ex) be the exact level of this two-tailed
test.
(a) Show that ex*(IX) :s; ex.
(b) Under what circumstances will we have 1X*(ex) = IX? Does it matter whether
the null hypothesis is simple? Does it matter whether the null distribution is
continuous or symmetric?
(c) Under what circumstances will T[ex*(ex)] = T(ex)?
(d) Under what circumstances willlX*[IX*(IX)] = IX*(IX)?
Problems 77

21. Consider a two-tailed test for which "extreme" is defined by the one-tailed P-value
of the test statistic. Assume the test statistic has a unique null distribution.
(a) If this distribution is continuous, show that the two-tailed P-value is twice the
one-tailed P-value.
(b) Derive the two-tailed P-value (described in the text) in the discrete case.
(c) Apply this definition to all possible outcomes in Table 4.1.
(d) In Table 4.1, what test corresponds to this definition and what exact levels are
possible?
22. Define a two-tailed P-value as the one-tailed P-value plus the nearest attainable
probability from the other tail.
(a) Apply this definition to all possible outcomes in Table 4.1.
(b) In Table 4.1, what test corresponds to this definition and what exact levels are
possible?
(c) Compare these results with (c) and (d) of Problem 21.
(d) Apply this definition to all possible outcomes for the following null distribution.

s 2 3 4 5 6

0.13 0.22 0.14 0.08 0.18 0.25

(e) Show that the P-values in (d) correspond to no test.


(f) Invent a less pathological example, preferably unimodal, with the same
property.
23. In a discrete case such as the binomial, show that the two-tailed P-value is a
discontinuous function of the null hypothesis for all the procedures of Sect. 4.5
except doubling the one-tailed P-value.
24. Construct a table similar to Table 4.2 for the probabilities of erroneous conclusions
in the three conclusion, two-tailed binomial tests defined below.
(a) Observed: S :::; St S ~ Sa Sf < S < S"
Conclusion: P < Po P ~ Po no conclusion
(b) Observed: S :::; S, S ~ S" St< S < S"
Conclusion P :::; Po P ~ Po no conclusion
25. In a bmomial situation with Ho: P = Po, let s;., s~, s~, s; be critical values at
exact levels (X~, (XI' (X2, and (X~ respectively, where (X'I < (X'; and (X2 < (X~. Make a
table like Table 4.2 for the test procedure below, and show that the maximum
probability of error is max{a'{, a;, a l + a2}'

Conclusion: P < Po P :::; Po P ~ Po P > Po no conclusion


26. Show algebraically that when n = 6 the power of the binomial test whIch rejects
H 0: p ~ 0.5 when S = 1 only is greater than the power of the test which rejects
when S = 0 only for all p > t.
27. Show that a lower-tailed randomized test with 4Jc = 0 or 1 for critical value Sc is
equivalent to a lower-tailed non-randomized test with critlcal value Sc - 1 or Sc
respectively.
78 1 Concepts of Statistical Inference and the Binomial Distribution

28. Show that, for any exact level oc,O ::;; oc ::;; 1, there IS exactly one lower-tailed
randomized test based on a given statistic S if the distribution of S IS uniquely
determined under H 0 and all tests with the same critical functIOn are considered
the same.
29. Show that when n = 6, the binomial test which rejects either if there are no successes,
or if there is just one success and it does not occur on the first trial, has exactly the
same conditional probability of rejection given S for every P as does the lower-tailed
randomized test based on S which rejects always if there are no successes and with
probabilityi if there is one success and it occurs on any trial. Hence, in particular, it
has the same exact level (oc = A for Po = 0.5) and the same power.
30. Show that for n = 6, H 0: p ~ 0.5, the most powerful (randomized) test at level oc,
for h < oc < -k, is to reject always when no success is observed and with prob-
ability (64ex - 1)/6 when 1 success is observed.
31. Consider the binomial problem of testing the simple null hypothesis Ho: p = 0.5
against the simple alternative p = 0.3 when n = 6 using a lower-tailed test based on
S, the observed number of successes in n trials. If we restrict consideration to non-
randomized tests, there are 7 different critical values of S, and hence only 7 different
exact levels oc possible. For each of these, the corresponding probability of a Type II
error Pis easily found from Table B.
(a) Plot these 7 pairs of values (ex, p) on a graph to see how ex and 13 interact.
(b) If randomized tests are allowed, any exact ex can be obtained. Find the ran-
domized tests for some arbitrary values of oc in between the exact values, and
compute the corresponding values of p. Plot these points on the graph in (a).
(c) Show that the points in (b) lie on the straight line segments which connect
successive points in (a). Complete the (ex, 13) graph for randomized tests. If the
nominal level is 0.10, the graph provides strong support for not choosing a
conservative (nonrandomized) test in this situation, while if the nominal
level is 0.05, the graph provides some support for using randomized tests in this
case. What if the nominal level is 0.20?

*32. Consider a one-tailed test based on a statistic. S whose distribution is uniquely


determined under Ho. Show that the following hold under Ho:
(a) If S is continuous, the P-value is uniformly distributed over (0, 1).
(b) If S is discrete, the c.dJ. of the P-value of a conservative test nowhere exceeds (is
"stochastically smaller" than) that of the uniform distribution.
(c) If S is discrete and a randomized test is carried out by rejecting for U < cfJ(x)
where U is uniform on (0, 1), then the corresponding P-value is uniformly
distributed between the exact P-value and the next P-value.
(d) This randomized P-value is uniformly distributed over (0, 1).
*33. Let P' be the expectation of the P-value under a simple (or least favorable) null
hypothesis. Then p' - tis one possible measure of how conservative a test is.
Evaluate this for the situation of Table 4.1 (binomial, n = 10, H 0: p = 0.6), for the
three procedures discussed there.

34. Prove that if a confidence region has confidence level 0.99, then it also has level 0.95.
35. Verify the numerical values of the upper and lower 90 %confidence limits shown in
Table 6.1.
Problems 79

36. How do the regions A(Oo) and C(S) of Sect. 6.2 relate geometrically to the Clopper-
Pearson charts?

37. Show that binomial confidence limits satisfy pAs) = 1 - Pu(1I - s).

38. Suppose that no successes occur in II Bernoulli trials with probability of success p.
Find the one-sided upper confidence limit for p at an arbitrary level a using (a) the
binomial distribution, (b) the Poisson approximation, and (c) the normal approxi-
mation. For II = 4, graph the upper limit as a function of a for each ofthe procedures
(a), (b) and (c).

39. One of the large automobile manufacturers has received many complaints con-
cerning brake failure in one of their current models. The cause was traced to a
factory-defective part. This same part was found defective in six out of a group of
sixteen cars inspected; these six cars were designated" unsafe."
(a) Test the hypothesis that if this model is recalled for inspection, no more than
10 %in this population will be designated" unsafe."
(b) Find an upper confidence bound for the proportion "unsafe," with a level of
0.95.
(c) Use the large-sample method to find an approximate upper 95 % confidence
bound.
(d) Find a two-sided 95 % confidence interval for the proportion of cars without
the defective part.
(e) What inference procedure seems most helpful to the company managers and
why?
(f) Which assumption for the binomial model is likely not be to satisfied in this
example?

40. Show that the confidence region corresponding to the usual two-tailed binomial
test (defined in Sect. 4.5) is an interval and that its endpoints are the confidence
bounds (defined in Sect. 6.1) at level 1 - (aI2).

41. Verify the result stated in (6.11) concerning the expected length of a confidence
interval.

*42. Let C(S) be a confidence region for a parameter 0, and let V(S) = JC(S) dO be the size
of C(S). Denote by Q(8) the probabilIty that C(S) includes 8 under any fixed
distribution of S, that is, Q(O) = P[O E C(S)]. Show that the expected size is
E[V(S)] = JQ(O)dO. (Hint: This is just a change of order of integration in disguise.)

*43. What happens in Problem 42 if only a portion of the range of 0 is considered


(e.g., 0 > 00 for 0 one-dimensional)?

*44. For the randomized, lower-tailed binomial test of p at exact level a, show that the
probabIlity of "acceptance" when s is observed is a(p, s) if 0:::; a(p, s) :::; 1, is 1
if a(p, s) ;::: I, and is 0 if a(p, s) :::; 0, where

*45. Show that a(p, s) as defined in Problem 44 is a decreasing function of p for fixed s.
80 J Concepls of SlallsllcaJ Inference and Ihe BIIl0l11lai Dlslnbullon

*46. The table below gives the usual upper 90 %confidence limits and some alternate
limits for a binomial parameter P when /I = 6. Note that the alternate limit is larger
for S = 0 than for S = 1.
(a) Show that the alternate procedure has confidence level 0.90.
(b) Show that, when the true P = 0.5, the alternate limits have smaller probability
of exceeding Po than the usual hmits for O.SOO < Po < O.SIO and the same
probabihty for Po > 0.510.
(c) Show that the alternate limits have smaller expected "excess" when P = 0.5.
(d) What happens in (b) and (c) for other values of p?
(e) Show that the "acceptance" regIOn for Ho: P = Po corresponding to the
alternate procedure is not an interval for 0.500 < Po < O.SIO.

s o 2 3 4 S 6

Usual Limit 0.319 O.SIO 0.667 0.799 0.907 0.983 1.00


Alternate Limit O.SlO 0.492 0.667 0.799 0.907 0.983 1.00

*47. Consider the family of nonrandomized, lower-tailed binomial tests which, for each
null hypothesIs value p, have exact level as near IX as possible. Show that the cor-
respondmg confidence regIOn IS an mterval.

*48. Using the facts stated in the first paragraph of Sect. 7.4, show that for a null hypo-
theSIS I' = Po or P ~ Po
(a) a randomized, lower-tailed binomial test
(i) is uniformly most powerful at its exact level;
(ii) uniformly minimizes the probability of rejection for p > Po among tests
as powerful at PI for any PI < Po;
(iii) is admissible;
(b) the class of all randomized, lower-tailed binomial tests IS complete.

*49. (a) Show that, under any true value Po, the usual, nonrandomized upper con-
fidence bound for the binomial parameter p uniformly minimizes both
(I) the probability of exceeding values of I' > Po,
and
(ii) the expected" excess" over Po,
among upper confidence bounds havmg no greater probability offalhng below
any true value Po.
(b) Show that the randomized upper confidence bound at exact level IX for all P has
the properties stated in (a).
(c) Prove a similar result for any confidence procedure corresponding to a family of
one-tailed binomial tests.

SO. (a) Show that there is one and only one unbiased, two-tailed test for a given
bmomial null hypothesis P = PoCO < Po < 1) at a given level IX.
(b) Show that this test IS not equal-tailed in general.
(c) Show that, in general, It is randomized at both critical values.
(d) Show that, in general, even adjusting IX will not make the test nonrandomized.

51. Show that a one-tailed binomial test is unbiased, and hence that a one-sided
confidence procedure is also unbiased.
Problems 81

52. Show that a minimum likelihood test based on a sufficient statistic S having a
unique null distribution is most powerful against the alternative that S is uniformly
distributed, among tests at the same exact level.
53. Suppose that an unbiased test of the binomial null hypothesis P = Po is gIven for
each Po, but the exact level cx{Po) varies with Po. What property related to unbiased-
ness does the corresponding confidence procedure have?
54. Show that if p is distributed uniformly over CO, 1) and, for given p, S is binomial with
parameters 11 and p, then the marginal distribution of S is discrete uniform on the
integers 0, 1,2, ... , II. (For a generalization, see Raiffa and Schlaifer [1961],
pp. 237-241.)

55. (Continuation of Problem 54) Show that the average power of a nonrandomized
test of a binomial null hypothesis p = Po, that is, the integral of the power curve
over p, equals the number of possible values of S which are in the rejection region
divided by II + 1.
56. Demonstrate formula (8.2) for the average expected length of a binomial confidence
region.

*57. Generalize formula (8.2) to binomial confidence regions which are not necessarily
intervals.

58. (a) Suppose we have a nonrandomized confidence procedure for a binomial


parameter p such that the region is always an interval. Show that we have
Ls J
L(s) = N(p)dp, where L(s) is the length of the confidence region for S = s
and N(p) is the number of values of S for which the confidence region includes p.
(b) Generalize (a) to regions which are not necessarily intervals.
*59. Show that any two-tailed binomial test whose upper and lower critical values are
not equal is admissible in the three-conclusIOn Interpretation with risk function (8.1),
under the assumptions (8.4)-{8.7) and the assumption that (p/q) [Lh(p) - Lc{p)]/
[L.(p) - Lb(P)] eIther approaches 0 as p -> 0 or approaches co as p -> 1. HInt:
Show that any such test is Bayes against some three-point prior distribution of p,
and is unique Bayes except at the critical values.
60. Show that if peS" ~ s,,) -> <1>(z) whenever (s" - /l")/(J" -> Z, as in (9.13), then
peS" ~ so) - <1>[(s" - tl,,)/(J"] -> 0 regardless of how s" depends on II, as in (9.14).
61. Show that if g"{x) 2: 0 for all x in some countable set B, then as 11 -> co, L.e B hm inf
g"{x) ~ lim inf LeB g ,(x).

62 Show the equivalence of the conditions (I )-(5) for convergence in distribution given
at the beginnIng of Sect 8.2.

63. (a) If X" is normal with mean 0 and variance 11, what is lim Fn(x)?
(b) If F is nondecreasing on ( - co, co) and 0 ~ F ~ 1, construct a sequence of
c.d.f.'s Fn such that F"(x) -> F{x) for every x at which F is continuous.
64. (a) Show that if a sequence of distributions on a finite set converges in distribution,
then the conditions of Theorem 9.1(1) hold.
(b) Give a counterexample for a count ably infinite set.

65. Apply the Central Limit Theorems 9.2 and 9.3 to the binomial distribution.
CHAPTER 2

One-Sample and Paired-Sample


Inferences Based on the
Binomial Distribution

1 Introduction
The goal of statistical inference procedures is to use sample data to obtain
information, albeit uncertain, about some larger population or data-
generating process. The inferences may concern any aspect of a suitably
defined population (or process) from which observations are obtained, for
example, the form or shape of the probability distribution of some variable
in the population, or any definable properties, characteristics or parameters
of that distribution, or a comparison of some related aspects of two or more
populations. Procedures are usually classified as non parametric when some
of their important properties hold even if only very general assumptions are
made or hypothesized about the probability distribution of the observations.
The word "distribution-free" is also frequently used in this context. We will
not attempt to give an exact definition of" non parametric " now or later, as it
is only this general spirit, rather than any exact definition, which underlies
the topics covered in this book.
In order to perform an inference in one-sample (or paired sample)
problems using the methodology of parametric statistics, information about
the specific form of the population must be postulated throughout or in-
corporated into the null hypothesis. The traditional parametric procedure
then either postulates or hypothesizes a specific form of population, often the
normal, and the inferences concern some population parameters, typically
the mean or variance or both. The exact distribution theory of the statistic,
and hence the probabilities of both types of errors in testing and the confi-
dence level in estimation, depend on this popUlation form. Such inference
procedures mayor may not be highly sensitive to the population form. If they

82
2 Quantile Values 83

are not, the procedure is said to be "robust." Robustness has been extensively
studied. (See, for instance Bradley [1968, pp. 28-40] and references given
there.)
A non parametric procedure is specifically designed so that only very
general characteristics of the relevant populations need be postulated or
hypothesized, for example, that a distribution is symmetric about some
specified point. The inference is then applicable to, and completely valid in,
quite general situations. In the one-sample case, the inference concerns some
definable property or aspect of the one population. For example, if symmetry
is assumed, such an inference might concern the true value of the center of
symmetry and its exact level may be the same for all symmetric populations.
Symmetry is a much less restrictive assumption than normality. Alternatively,
the inference may be an estimate or hypothesis test of the value of some other
parameter in a general population. In short, a nonparametric procedure is
designed so as to be perfectly robust in certain respects (usually the exact
significance or confidence level) under some very general assumptions.
The remainder of this book will consider various situations where in-
ferences can be made using non parametric procedures, rather than studying
post hoc the robustness of parametrically derived procedures. In this chapter
the inferences will be based on the binomial probability distribution; how-
ever, they are valid for observations from general populations. The first type
of inference to be covered relates to the value of a population percentile point
like the median, the first quartile, etc. For data consisting of matched pairs
of measurements, the same procedures are applicable for inferences con-
cerning the population of differences of pairs. If the matched pair data are
classificatory rather than quantitative, for example classified as either success
or failure, inferences about the differences of pairs can also be made by similar
procedures, but they merit separate discussion. Finally, we will discuss one-
sample procedures for setting tolerance limits for the distribution from which
observations are obtained.

2 Quantile Values
Most people are familiar with the terms percentiles, median, quartiles, etc.,
when used in relation to measurement data, for example, in reports of test
scores. These are points which divide the measurements into two parts, with
a specified percentage on either side. If a real random variable X has a
continuous distribution with a positive density, the statement that the median
or fiftieth percentile is equal to 10 means that 10 is the point having exactly
one-half of the distribution of X below it and one-half above. This statement
can be expressed in terms of probability as

P(X < 10) = 0.5, P(X > 10) = 0.5.


84 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution

A similar probability statement could be made for other parameters of this


type. For example, 10 is the 75th percentile point if P(X < 10) = 0.75,
P(X > to) = 0.25.
While such a definition of these order parameters is simple to understand
and interpret, it is fully satisfactory only in those cases where the point so
defined exists and is a unique number. In order to take care of all cases, we
give the following more explicit definition. For any number p, where
o :$ p :$ 1, the pth quantile, or the quantile of order p, of the distribution of
e
X is any value p which satisfies
(2.1)
(This parameter may also be called the pth fractile or fractile of order p, the
p-point, or the (lOOp)th percentile point.) Equation (2.1) says that the
e
probability to the left of ~p is at most p if ~p is excluded, and at least p if p is
included. An equivalent statement is that if ~p is excluded, the probability
to the left is at most p and the probability to the right is at most 1 - p.
We now investigate more specifically the implications of the definition in
(2.1) for various types of distributions. We shall see that (a) ~p may be unique
and belong to a unique p, (b) the possible values of ~p for given p may form
a closed interval, or (c) ~p may have the same value for a closed interval of
values of p. Thus p and ~p have a one-to-one relationship only in case (a).
The three types of situation which lead to these possibilities are explained
below and are illustrated in Fig. 2.1 as (a), (b) and (c) respectively.
(a) The simple case is where X has a strictly increasing, continuous c.dJ.,
as holds if it has a positive density. Then ~p exists, is unique and applies to

F(x) F(x)

x x

(a) (b)

F(x)

p{

~ x

(c)

Figure 2.1
3 The One-Sample Sign Test for Quantile Values 85

just one value of p. Thus there is a one-to-one relationship between p and p , e


e
and given p or p, the other one can be found as that unique number which
satisfies F(e p ) = p.
(b) Suppose that F is not strictly increasing, that is, F is constant in some
interval of positive length. If the value of F in the interval is p, every point in
the interior of this interval has probability p to the left of it (and 1 - p to the
e
right) and hence is a possible value of p • The endpoints of the interval also
satisfy (2.1) (Problem 1). Thus the pth quantiles of the distribution form a
closed interval of positive length. This nonuniqueness can easily be removed
by some arbitrary convention, but this does not reduce the difficulties in
e
inference procedures relating to a nonunique p • The most typical convention
is to call the midpoint of the interval of values "the" pth quantile. We shall
not follow this convention, although sometimes, as we have already done, we
take the liberty of saying "the" pth quantile when "a" pth quantile would be
more accurate.
(c) Suppose that F is discontinuous, as for a discrete distribution. At each
e, e
discontinuity point there must be ajump in F. Then is a pth quantile for
a whole interval (endpoints included) of values of p, and this interval has
e
positive length since, by (2.1), a given value is the pth quantile for any p
satisfying
P(X < e) ~ p ~ P(X ~ e). (2.2)
This discussion shows that by our definition in (2.1), for any p there is at
e e
least one pth quantile p , and any is a pth quantile for at least one value of p.
e
In any case, the set of possible combinations of p and p is given by the c.dJ.
with any vertical jumps filled in (Problem 2). Of course, a c.dJ. may exhibit
more than one of (a)-(c) (Problem 4).
e
Since any quantile p is a population parameter, point estimates, tests of
e
hypotheses, and confidence intervals for p are all of interest. The natural
e
point estimate of p is a sample quantile of the same order p, but a precise
definition of a sample quantile must be arbitrary if a unique value is desired.
e
These problems do not arise if a confidence interval for p is obtained, and
confidence intervals provide more information and are more useful in most
situations anyway. For the most part, discussion of inferences about quantiles
in this book will be restricted to confidence intervals and tests of hypotheses.

3 The One-Sample Sign Test for Quantile Values


3.1 Test Procedures

Let Xl' ... , X n be n independent observations drawn from the same distri-
bution, and suppose we wish to test the null hypothesis that the median of
this distribution is O. Let us suppose that the point 0 does not have positive
probability, that is, assume that P(X, = 0) = O. (The contrary case is more
86 2 One-Sample and Paired-Sample Inferences Based on the BmomIaI DlstnbutlOn

complicated and will be discussed in Sect. 6.) Then with probability 1, each
observation is either positive (>0) or negative «0), and P(Xj < 0) =
P(Xj > 0) = 0.5 under the null hypothesis. We could then test the null
hypothesis by counting the number S of negative (or positive) observations.
Under the null hypothesis, S is binomial with parameter p = 0.5, and the
tests discussed in Chap. 1 for this hypothesis may be applied to S as defined
here.
An upper-tailed test, rejecting when S is larger than some critical value,
that is, when there are too many negative observations, is appropriate against
alternatives under which the probability of a negative observation is larger
than 0.5, that is, alternatives with P(X j < 0) > 0.5. Under such alternatives,
the population distribution is more negative than under the null hypothesis
in the sense that its median is negative instead of O.
A lower-tailed test is appropriate against alternatives under which the
population distribution has a positive median. A two-tailed test is appropriate
when one is concerned with both alternatives.
Let F be the c.dJ. of the distribution from which the Xl are sampled. Since
we are assuming that the point 0 does not have positive probability, F(O) =
P(X j ~ 0) = P(X j < 0), and the null hypothesis can be stated as H 0 :
F(O) = 0.5. Notice that an alternative distribution with F(O) > 0.5 is more
negative, in the above sense, because if the probability of a negative observa-
tion exceeds 0.5, the population median must be negative. That is, loosely
speaking, the larger F is, the more negative the population. This is illustrated
in Fig. 3.1, for arbitrary c.d.f.'s F 1 and F 2, where F 2(0) > F 1(0) = 0.5 and the
medians are related by e2 el
< = O.
The power of the tests above is, of course, just the power of whichever
binomial test is used (lower-, upper-, or two-tailed) against the alternative
F(O) = p for some p '# 0.5.
Of course, there is nothing special about the particular value 0 for the
median. To test the null hypothesis that the distribution of every observation
has median eo, say, assuming that P(Xj = eo) = 0 for allj, we would define
S as the number of observatioris which are smaller than eo. Under this null

F(x)

Figure 3.1
3 The One-Sample Sign Test for Quantile Values 87

hypothesis S is again binomially distributed with parameter p = 0.5 and the


tests previously discussed apply in exactly the same way.
Similarly, there is nothing special about the median rather than some
other percent point. To test the null hypothesis that the distribution of every
observation has po-point ~o, assuming that P(Xj = ~o) = ofor allj, we would
define S as the numbeJ;. of observations which are smaller than ~o. Under the
null hypothesis, P(X j .~ ~o) = Po, so that S is binomially distributed with
parameter P = Po. Again the binomial tests may be employed.
These procedures are all frequently called" sign tests" because they use
only the signs of X j - ~o and not the precise values of the observations.
While the independence of the observations is fundamental, the other
assumptions can be relaxed some without affecting the validity of the test.
If the observations are not identically distributed under the null hypothesis,
the level of the test is unchanged provided that each observation is drawn
from a distribution with po-point ~o. (For paired samples, the null hypothesis
of median 0 may be justified by random assignment. See Sect. 7.) An upper-
tailed (lower-tailed) test is appropriate against alternatives under which the
population distributions of the observations all have negative (positive)
medians. A two-tailed test is appropriate when one is concerned with both of
these alternatives; this means either that the population distributions of the
observations all have negative medians, or that they all have positive medians,
but not that some have negative medians and some positive. Against strongly
mixed alternatives of the last kind, one- and two-tailed sign tests may all be
ineffective and other procedures would be desirable.
The situation is more complicated if the probabilities Pj = P(Xj < ~o) are
allowed to differ under the null hypothesis. A "random effects" model under
which the Pj are independently sampled from a population with mean P
reduces immediately to the original situation provided P is the parameter of
interest. A simple result for the "fixed effects" model is that the one-tailed
tests are valid for either Ho: Pj :s;; Po for allj or Ho: Pj 2 Po for allj, as appro-
priate. A much deeper result for the fixed effects model is that if one is inter-
ested in the parameter p = Ljpj/n, then a one- or two-tailed sign test is valid
for the null hypothesis P = Po provided the level is small enough (up to
approximately 0.5) so that the "acceptance" region includes the integers
just above and below npo (IJPo itself if it is an integer). This "main effect" P
may well not be the parameter of interest, however. Indeed its definition even
depends on the sample size unless the experiment or sample is specifically
designed to avoid this. Unfortunately, for other null hypotheses the sign test
may be invalid or its validity not strictly proved.
The power against alternatives under which the observations are not
identically distributed can be calculated from the binomial distribution as
before if Pj is the same for allj, or if the random effects model mentioned above
holds. Under a fixed effects model with differing Pj' the power might be
approximated by treating S as binomial with parameter P = Ljpj/n. This
overestimates the power whenever the "acceptance" region includes the
88 2 One-Sample and PaIred-Sample Inferences Based on the Binomial DIstribution

integers just above and below np, that is, whenever p ~ (St + l)/n and/or
p ~ (sa - l)/n where St and Sa are the lower and upper critical values respec-
tively. It underestimates the power of a one-tailed test for p ~ St/n or p ~ s"ln
as relevant (where the power exceeds 0.5, approximately). The power of a
two-tailed test can be underestimated in the same region by simply ignoring
the (ordinarily negligible) probability of rejection in the" wrong" tail and
using the one-tailed lower bound for the" correct" tail. These results for the
"fixed effects" model are due to Hoeffding [1956].

3.2 "Optimum" Properties

The one-tailed sign tests are, in a sense which will now be described, the best
possible. Certain of the two-tailed tests have similar but less strong proper-
ties, and all are admissible.
Assume once more that the observations XI, ... , Xn are independent and
identically distributed and we wish to test the null hypothesis that the median
~ of the population distribution is a particular number ~o. Assume also that
P(XJ = ~o) = 0 under both null and alternative hypotheses. Consider the
class of tests based on S, the number of observations smaller than ~o. Since
S is binomial, there is a level IX test based on S which is uniformly most
powerful against one-sided alternatives, namely the appropriate one-tailed
binomial test at exact level Ct. This test is just the one-tailed sign test at exact
level Ct.
What if one considers not only tests based on S, but also tests which make
further use of the original observations? One might think that a better test
could be produced by taking into account how far above and below ~o the
observations fall, rather than just how many fall above and how many below.
This is not possible, however, as long as one insists that the test have level Ct
for every popUlation distribution with median ~o. Compared to any other
such test, the one-tailed sign test at exact level Ct which rejects when S is too
small has greater power than any other test against every alternative distri-
bution under which the median exceeds ~o. A similar statement holds for the
test rejecting in the other tail and alternatives on the other side. In other
words, a one-tailed sign test at exact level Ct is uniformly most powerful
against the appropriate one-sided alternative, among all level Ct tests based
on XI' ... , X n for the null hypothesis that the median is ~o.
For two-sided alternatives there is, of course, no uniformly most powerful
test of the null hypothesis that the median is ~o. Suppose, however, that we
consider only unbiased tests, that is, tests which reject with probability at
least as great under every alternative as under the null hypothesis. Among
these tests, the unbiased sign test (which is equal-tailed) is uniformly most
powerful.
The symmetry of the situation suggests another way of restricting the class
of tests to be considered with a two-sided alternative. Suppose we had
3 The One-Sample SIgn Test for QuantIle Values 89

observed not X 1>"', Xn but Y1>"" Y", where Y; is the same distance from
eo as Xj but on the other side (that is, Y; = 2eo - Xj)' If the Xj satisfy the
null hypothesis, so do the lj; ifthey satisfy the alternative, so do the lj. Hence,
in the absence of other considerations, it seems equally reasonable to apply
a test to the Y's as to the X's, and it would be unpleasant if different outcomes
resulted. This suggests requiring that a test be symmetric in the sense that
applying it to the Y's always gives the same outcome as applying it to the
X's. Among such tests also, the equal-tailed sign test is uniformly most
powerful.
Every two-tailed sign test is admissible (in the two-conclusion interpreta-
tion of two-tailed tests); that is, any test having greater power at some
alternative has either smaller power at some other alternative or greater
probability of rejection under some distribution of the null hypothesis.
The restriction to identically distributed observations can be relaxed
without affecting any of the properties above. That is, the results hold for
alternatives under which X 1"", Xn are independent, with P(Xj < eo) the
same for allj, but are not necessarily identically distributed, provided the null
hypothesis is similarly enlarged.
If" po-point" is substituted for the hypothesized median value eo through-
out, the foregoing properties continue to hold, except that an unbiased test
is not equal-tailed when Po :f= 0.5, and the discussion of symmetry no longer
applies. In summary, the" optimum" properties ofthe sign tests are as follows.
If X 1> ••• , X n are independent and identically distributed with
P(X j = eo) = 0, then among tests of the null hypothesis P(X j < eo) = Po:
(a) A one-tailed sign test is uniformly most powerful against the appropriate
one-sided alternative;
(b) Any two-tailed sign test is admissible;
(c) A two-tailed, unbiased sign test is uniformly most powerful against the
two-sided alternative P(Xj < eo) :f= Po among unbiased tests and, when
Po = 0.5, among symmetric tests.
If Xl' ... , X n are not necessarily identically distributed but are indepen-
dent with P(Xj < eo) = P(Xj ~ eo) = p for allj, then the same statements
apply to the null hypothesis p = Po and the alternative P :f= Po.
The proof of the foregoing statements, which will be given in Sect. 3.3,
depends on the fact that the null hypotheses are very broad and are satisfied
by some peculiar distributions, like the density represented by the dotted line
in Fig. 3.2(b). If one is willing to test a more restrictive null hypothesis, there
could well be tests which are more powerful, at least against some alterna-
tives. For instance, the hypothesis that the median is eo might be replaced by
the hypothesis that the distribution is symmetric around eo. Nonparametric
tests of this null hypothesis will be discussed in Chap. 3 and 4.
For Po :f= 0.5, restrictions of the null hypothesis that the po-point is eo
have been studied only for parametric situations. For instance, the hypothesis
might be that the observations come from a normal population with po-point
90 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution

,,
, I
f(x)
,I

,,
I

,,
I
I

, ,
,
~o X

(a) (b)

Figure 3.2

~o. Procedures based on such assumptions are outside the scope of this book,
but it is relevant to mention that considerable risk accompanies their apparent
advantages. The risks and advantages are easily seen in terms ofthe estimators
involved. With no distribution assumption, the true population probability
below ~o, say p, would be estimated by Sin, where S is the number of observa-
tions below eo in the sample. With the assumption of a normal population,
the probability below eo is pi = (J>[(~o - 1i)lo], where Ii and (J are the
population mean and standard deviation and <II is the standard normal c.dJ.
Intuition would lead one to use P = (J>[(eo - 1')ls] as an estimator for pi,
where X and s are the sample mean and standard deviation. (This estimator
Pis slightly biased for pi in normal populations, but can be adjusted to be
unbiased and in fact minimum variance unbiased [Ellison, 1964].) Under
normality, p (or Padjusted) is a much better estimator than Sin. However, a
departure from normality which looks minor can easily lead to an important
difference between pi and the true proportion p and therefore very poor
properties for p. (Typical goodness-of-fit tests of the normality assumption
will throw almost no light on this crucial question.) In fact, such information
as the sample provides about p beyond the value of Sin is, in common sense,
relatively little and difficult to extract. The advantage of the estimator p over
Sin relies most heavily on the assumed normal shape when ~o is in the extreme
tails, and hence these reservations about p apply most strongly when p is
close to 0 or 1, which is unfortunately just when the advantage of p is also
greatest. (See also Sect. 3.1 of Chap. 8.)
Why do similar reservations not apply to non parametric procedures
based on symmetry? Because symmetry may well be more plausible than
normality, and the effect of departures from the assumption are less serious.
In fact, nonparametric procedures based on symmetry are often used to make
inferences about the location of the" center" of the population distribution;
a departure from symmetry will require that this "center" be defined some-
how, and the definition implicit in the procedure used may be satisfactory.
In contrast, an inference about p is typically made in a situation where the
3 The One-Sample Sign Test for Quantile Values 91

proportion of the population beyond some critical point is of interest (a


tolerance limit, for instance). Then it will not be satisfactory if the inference is
really about some other parameter, like p' above, which may differ appreci-
ably from p under mild departures from the assumptions.

*3.3 Proofs

We will demonstrate the optimum properties summarized above for the sign
tests only in the case where XI, ... , Xn are independent and identically
distributed with P(Xj = ~o) = O. We will prove that the one-tailed, and
unbiased two-tailed, sign tests of the null hypothesis P(Xj < ~o) = Po are,
respectively, uniformly most powerful, and uniformly most powerful un-
biased (and, when Po = 0.5, uniformly most powerful symmetric) against the
appropriate alternatives. The proofs of admissibility and the extension to
observations with possibly different distributions are requested in Problems
11 and 12.
Consider any alternative distribution G with PG(X j < ~o) =
P G(X j :0:;; ~o) = PG #- Po· For any p, define F as a distribution with prob-
ability P to the left of ~o and the same conditional distribution as G on each
side of ~o. (To simplify notation, the dependence of F on P will not be in-
dicated.) Specifically, let

{
~ G(x) for x :0:;; ~o
F(x) = PG
1- ~ [1 - G(x)] for x 2 ~o.
1 - PG
Then PF(Xj < ~o) = PF(Xj :0:;; ~o) = p. Also, the conditional distribution of
Xj' given that Xj < ~o, is the same under F as G, as is the conditional
distribution given that Xj > ~o. The family of distributions F includes G
(when P = PG) and a null distribution F 0 (when P = Po). The point of the
definition is that F 0 is the null distribution" most like" the given alternative
G, or "least favorable" for testing against G, as we shall see.
Figure 3.2(a) illustrates possible c.d.f.'s of the type defined by F and G;
notice the abrupt change of slope in F at ~o. The corresponding densities f
and g are shown in Fig. 3.2(b), where f and g are related by

L g(x) for x < eo


f(x) = { Po
1-p
-1-g(x) for x > ~o.
- PG
Thusfis the same as g on each side of eo except for a scale factor, butfhas eo
as a quantile of order p.
92 2 One-Sample and PaIred-Sample Inferences Based on the Bmomial Distribution

If Xl, ... , Xn are independent and identically distributed according to


one of the distributions F, then S, the number of observations below eo, is a
sufficient statistic (Problem 13) and is binomially distributed with param-
eters nand p. It follows that the appropriate one-tailed, level rx sign test is
most powerful against G among level rx te~ts of the null hypothesis Fo. It is
therefore most powerful against G among level rx tests of the original null
hypothesis that P(Xj < eo) = Po because (a) it has level rx for the original
null hypothesis, while (b) any level rx test of the original null hypothesis is
also a level rx test of F 0 and hence cannot be more powerful against G. Since G
was arbitrary, a one-tailed sign test is uniformly most powerful against
alternatives on the appropriate side.
Similarly, the unbiased, level rx, two-tailed sign test is most powerful
against G among level rx tests of the null hypothesis F0 which are unbiased
against the alternatives F for which p =F Po. It is, therefore (Problem 14),
most powerful against G among level rx, unbiased tests for the original
problem, and hence, uniformly most powerful.
When Po = 0.5, the symmetric, level rx, two-tailed sign test is most power-
ful against G among symmetric, level rx tests of the null hypothesis Fo. It is,
therefore (Problem 15), uniformly most powerful among level rx, symmetric
tests for the original problem.
The above proof contains, in effect, a proof of the following theorem,
which will be useful later in this book. The formal proof is requested in
Problem 16.

Theorem 3.1. Let Ho be a null hypothesis and suppose that H'o is contained in
H o·
(a) A test is most poweiful against an alternative G among level rx tests of H 0
ifit has level rxfor Ho and is most powerful against G among level rx tests of
H'o.
(b) A test is most powerful against G among level rx tests of Ho which are
unbiased against Hl if it has level rxfor H o, is unbiased against H 1 , and is
most poweiful against G among level ex tests of H'o which are unbiased
against H'1 , where H~ is contained in H 1.
(c) A test is most powerful against G among level ex symmetric tests ofHo ifit is
symmetric, has level ex for H 0, and is most powerful against G among
symmetric level rx tests of H'o.
(d) The property in (c) holds if the requirement of symmetry is replaced by any
requirement that does not depend on H o.*

4 Confidence Procedures Based on the Sign Test


Assume that Xl' ... , Xn are independent observations, identically and
continuously distributed. Then, with probability one, no two observations
are equal. We have discussed the sign test procedure for hypotheses of the
4 Confidence Procedures Based on the SIgn Test 93

formP(X j < ~) = p. We shall now derive two types of confidence procedures


which correspond to these tests. First, confidence intervals are derived for
the true value of p when ~ is fixed, and second, confidence intervals are
derived for the true value of ~ when p is fixed, i.e., for the quantile ~p. The first
requires only brief mention. The second involves the concept of "order
statistics," which merit some discussion in their own right.
The confidence bounds for p = P(Xj < ~) for a fixed ~ which correspond
to the sign test are obtained by direct application of the standard binomial
confidence procedures (Sect. 6 of Chap. 1) to S, the number of observations
below ~, since the test is based on S and since S is binomial with parameters
nand p. Many properties of these confidence bounds follow immediately
from the previous section and the results of Chap. 1.
The confidence bounds for a quantile ~P' such as the median, which
correspond to the sign test turn out to be certain order statistics of the sample,
which are defined as follows. For a set of observations X I' ... , Xn which are
all different, let X(1) denote the smallest of the set, X(2) the second smallest,
etc., and X(n) the largest. Then, since we assumed that there are no ties, we
have X(1) < X(2) < ... < X(n), and these are the original observations after
arrangement in increasing order of magnitude. They are collectively called
the order statistics of the sample, and X(r)' for 1 ~ r ~ n, is called the rth
order statistic. Note that the sign of an observation is considered in deter-
mining its size; for instance, - 3, - 2, 0, 1 are arranged in order of size. (If
there are ties, we can only require that X(1) ;:5; X(2) ;:5; ••• ;:5; X(n)·)
The property of order statistics which is of immediate relevance is that,
e
for any ~, we have X(r) < if and only if at least r of the n observations are
less than ~. Consider now a one-tailed sign test at level rx with critical value
Se in the lower tail of S. This test" accepts" the null hypothesis ~p = ~ if and
e.
only if at least Se + 1 of the observations are smaller than This is equivalent
to having X(SI+ I) < ~, by the property just stated. Therefore X(SI+ I) is the
e
level 1 - rx lower confidence bound for p corresponding to the sign test.
For example, a lower-tailed binomial test of Ho: p = 0.5 with n = 18
observations rejects at level 0.05 if there are 5 or fewer successes (observations
smaller than the hypothesized median). Therefore X(6), the sixth smallest
observation, is a lower 95 % confidence bound for the population median
for any set of 18 observations. Since from Table B the exact level of this test
is 0.0481, the exact confidence level is 0.9519.
An upper confidence bound is found similarly. If S" is the critical value
for an upper-tailed, level rx binomial test with parameters nand p, then the
(sJth smallest observation is the level 1 - rx upper confidence bound for ~p
since the sign test rejects ~p = ~ if and only if at least S" observations are
smaller than~. For instance, 13 is the critical value for a level 0.05, upper-tailed
test of p = 0.5 with n = 18. Therefore X(13)' the 13th smallest observation,
is a 95 % upper confidence bound for the median, with exact confidence
coefficient again 0.9519.
The interval between the lower confidence bound at level 1 - rxl and the
upper confidence bound at level 1 - rx2 is, of course, a confidence interval
94 2 One-Sample and Paired-Sample Inferences Based on the Bmonllal DlstnbUl10n

for the quantile ~p at level 1 - al - 0:2' This corresponds to a two-tailed


sign test with lower and upper tail probabilities 0: 1 and 0:2'
For p = 0.5, an equal-tailed test will have s" = n - St, by symmetry. The
corresponding confidence interval for the median is then the interval between
the (St + l)th and the (n - st)th smallest observations. This interval has the
symmetry one would expect, inasmuch as the lower endpoints is the (s( + 1)th
smallest observation and the upper endpoint is the (St + 1)th from the largest.
(The jth smallest observation is the (n + 1 - j)th from the largest, and
n + 1 - j = n + 1 - s" = St + 1 here.) For instance, with 18 observations,
the interval between the 6th and 13th smallest observations is a 90 % con-
fidence interval for the median, with exact level 1 - 2(0.0481) = 0.9038.
There is no such symmetry for p =P 0.5, that is, for quantiles other than the
median.
If the quantile is not unique, there is an interval of values for the quantile
as explained earlier in Sect. 2. Then the confidence procedures just described
are equally valid for any determination of the quantile. In fact, the prob-
ability is at least 1 - 0: that the lower 1 - 0: confidence bound will fall at or
below the lower end of the interval, and similarly that the upper bound will
be at or above the upper end (Problem 19).
Whether or not the confidence interval is taken to include its endpoints
makes no practical difference in any ordinary application, and it makes no
theoretical difference as long as the population distribution is continuous,
that is, has no points of positive probability. The error rate for a discon-
tinuous population will be at most that for a continuous one if the endpoints
are included in the confidence interval, and at least that for a continuous one
if they are excluded (Problem 20, or, more generally, see Sect. 6 of Chap. 3).
If the observations Xl, ... , X n are independent but are not identically
distributed, the procedures above remain valid provided that the quantile
~p is identical for all populations from which the observations are drawn. If
the quantiles are not unique, the confidence intervals are valid for every ~p
which is a quantile of order p for all the populations.
The properties of these confidence procedures correspond to those of the
tests from which they were obtained, which were developed in Sect. 3. Here
we shall say only that one can hope to do better only by making further
assumptions. Alternative confidence procedures for the median based on the
assumption of a symmetric population are discussed in Chap. 3, especially
Sect. 4. For other quantiles, the only available alternative procedures are
based on parametric assumptions and are sensitive to deviations from these
assumptions, as discussed earlier.
We have developed the procedures for finding confidence bounds and
confidence intervals for quantiles as those corresponding to the sign test.
Since the end points are order statistics of the sample observations, these
confidence regions can also be developed independently of the sign test using
the principles and properties of order statistics. Thus we will now give a brief
introduction to some properties of order statistics and show that the confi-
dence regions developed above can be obtained directly from them.
4 Confidence Procedures Based on the Sign Test 95

Order statistics are particularly useful in nonparametric statistics, partly


because of the properties of an important device called the probability integral
transformation. If X has a continuous c.dJ. F, then the transformed random
variable V = F(X) has a uniform distribution on the interval (0, 1). Further-
more, if X Cr ) for 1 :s; r :s; n are the order statistics of a sample of n from any
distribution with continuous c.dJ., then the transformed random variables
Vcr) = F(X(r» for 1 :s; r :s; n are the order statistics of a sample of n from the
uniform distribution on the interval (0, 1), and therefore the Vcr) are distri-
bution-free. These and other properties of order statistics are Problems
21,29-36.
Consider the rth order statistic for a sample of n from any distribution F.
Then since X(r) :s; x if and only if at least r of the n observations are less than
or equal to x, the c.dJ. of X(r) is the upper tail of the binomial distribution
with parameters nand F(x), that is

P(X cr ) :s; x) = ktr (~)[F(X)]k[1 - F(x)]n-k. (4.1)

Now suppose that F is continuous and x is its quantile of order p so that


= p. Then (4.1) can be written as
F(~p)

P(X cr ) :s; ~p) = P(X cr ) < ~p) = kt (~)l(1 - p)n-k. (4.2)

Thus a lower confidence bound for ~p at any desired (attainable) level 1 - IX


can be found by setting the right-hand side of (4.2) equal to 1 - IX, that is, by
determining that value of r which has upper tail probability 1 - IX in the
binomial distribution with parameters nand p. The lower bound at conserva-
tive level 1 - IX is X cr ) for the smallest integer r which makes the right-hand
side of(4.2) at least 1 - IX. An upper confidence bound for ~p is found similarly.
Further, we have (Problem 34)

P(X cr ) < ~p < XCv» = P(Xcr) < ~p) - P(XCV) :s; ~p)

= Vrl (n)pk(1 _pl-k. (4.3)


k=r k
e
Thus we obtain a confidence interval for p at any desired confidence level
1 - IX by suitable choice of r and v, using the binomial distribution.
The result in (4.2) can also be written in the form of the beta distribution
(Problem 30) as

P(X(r) < e p) = J: [B(r, n - r + 1)] -lur-l(l - ul- r du (4.4)

where
(r + v - I)!
B(r, v) = (r _ 1)! (v - I)! (4.5)
96 2 One-Sample and Paired-Sample Inferences Based on the Binomial DistrIbution

e
As a final comment, we note that the event X(r) < p is equivalent to
F(X(r» < p. Applying the probability integral transformation, the latter
inequality can be replaced by U(r) < p where U(r) is the rth order statistic
from a uniform distribution on (0,1). Since the density of U(r) is the integrand
of (4.4) (Problem 30), this observation provides a direct verification of the
expression given in (4.4).

5 Interpolation between Attainable Levels


When the possible values of IX are discrete, as above, there may be no (non-
randomized) confidence bound at exactly the level desired. One might then
be content with a nearby level which is attainable. If not, a bound at approxi-
mately the. desired level may be obtained by interpolating linearly between
the two neighboring attainable levels. That is, if WI and W2 are two lower
(or two upper) confidence bounds at levels 1 - IX I and 1 - 1X2 respectively,
with IXI < IX < 1X2' then a lower (or upper) confidence bound with level
approximately 1 - IX (and certainly between 1 - 1X1 and 1 - 1( 2 ) is
(1X2 - IX)Wl + (a - 1(1)W2
(S.1)
1X2 - 1X1

In the example of the previous section where n = 18, a lower confidence


bound was desired for the median at level 0.95. The relevant lower-tailed
rejection regions of S for p = 0.5 are S :::; 5 at the level 0.0481 and S :::; 6 at
level 0.1189, and the corresponding lower confidence bounds on p are X(6) e
and X(7)' with levels 0.9519 and 0.8811 respectively. Using (5.1), a lower
e
confidence bound on p at level approximately 0.9S is
(0.0689X(6) + 0.0019X(7)
0.0708
The approximation in (S.1) is based on simple linear interpolation and
could be used with almost any confidence procedure. Since an hypothesis
test can be performed by comparing the hypothesized value to the confidence
interval endpoints, a confidence procedure at level approximately 1 - a
yields directly a test at level approximately a.
The accuracy of this approximation can be given algebraically in the
following, interesting case. Suppose that for any n, two neighboring order
statistics X(i) and X(i+ 1) constitute two lower confidence bounds for some
e
quantile at levels say 1 - al and 1 - a2 respectively. If (X(I) + X(i+ 1)/2
is used for the confidence bound on e,
then the level is approximately
1 - (al + 1(2)/2 according to linear interpolation as in (S.l). However, when
e e,
is the median and the population is symmetric about the methodology
explained in Sect. 7.3 of Chap. 3 will show that (X(i) + X(I +1)/2 is a lower
confidence bound at the true level 1 - a, where a = (1 - i/n)a l + (i/n)rx2
6 The SIgn Test wIth Zero DIfferences 97

(see Problem 49, Chap. 3). Thus the weights for the true IX are (1 - iln, iln),
while the approximation above uses (1/2, 112). For IJ. fixed, iln ~ 1/2 as
n ~ 00, but at typical levels and in samples small enough to make discreteness
severe and interpolation really interesting, iln is not close to 1/2 and the
approximation in (5.1) is disappointingly inaccurate. For instance, for IX near
0.05 the exact values of iln are 0.28 when n = 9 and 0.32 when n = 17
(Problem 39). These two sample sizes were chosen since the attainable levels
for testing p = 0.5 are approximately equidistant from 0.05 (see Table B).

6 The Sign Test with Zero Differences

6.1 Discussion of Procedures

Suppose that Xl"'" X n are independent and identically distributed


observations and we wish to test the null hypothesis that the distribution has
po-point ~o, i.e., Po th quantile ~o, as in Sect. 3. If one or more of the observa-
tions is exactly equal to ~o, the test has not yet been defined. Such observa-
tions are called "zero dijJerences" or sometimes "ties." They can arise
theoretically only if the population distribution has positive probability at
~o, and the appropriate procedure depends on how we wish to treat such
populations. In practice, they can always arise because of limitations on
precision in measurement.
If a test procedure is of interest simply as a method of obtaining a confi-
dence interval for the p-point, the problem of zero differences generally need
not be resolved because it affects only whether or not the endpoints are
included in the interval (see the discussion in Sect. 4).
However, if one really wants to perform a test for its own sake, or to obtain
confidence intervals for a population probability p for fixed~, then any of the
following parameters may be of interest:
(a) the true probability at or below ~o, that is Ps = P(Xj ~ ~o);
(b) the true probability below ~o, that is, p< = P(Xj < ~o);
(c) the po-point ~po as defined in Sect. 2;
(d) the difference between the probability below and the probability above
~o, that is, P(Xj > ~o) - P(Xj < ~o) = p> - p<.

The following facts are relevant to these respective cases.


(a) The number S 5 of observations at or below ~o has a binomial distri-
bution with parameters nand p s' Tests of hypotheses and confidence
procedures for p 5 can thus be based on S 5 .
(b) The number S < of observations below ~o has a binomial distribution
with parameters nand p <. Tests of hypotheses and confidence procedures
for p < can thus be based on S < .
98 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution

(c) Since eo is a po-point if and only if P< S; Po S; P::;, by definition (2.1),


tests concerning the po-point eo can be obtained from (a) and (b). Specifically,
e
the null hypothesis P< S; Po is equivalent to p ~ ~o and the alternative
P< > Po is equivalent to ~po < ~o (Problem 40. In each case, the largest ~po
is to be used if ~po is not unique.) Similarly, the null hypothesis p::; ~ Po is
equivalent to ~po S; ~o and the alternative Ps < Po to ~po > ~o (Problem 40.
Here the smallest ~po is to be used if epo is not unique.) The relevant one-tailed
tests reject these null hypotheses for S < ~ s" and S S < Sf respectively, where
Sf and s" are the lower and upper critical values of binomial tests of the null
hypothesis p = Po. A test of the null hypothesis that ~o is a po-point against
two-sided alternatives therefore rejects if either S s S; Sf or S < ~ s". Note
that the precise definition of po-point is (unpleasantly) important here.
(d) Frequently one is concerned with whether the proportion of the
population below ~o is smaller than, equal to, or larger than the proportion
above ~o, that is, with the relation between p< = P(Xj < ~o) and p> =
P(Xj > ~o). Then those observations equal to eo seem irrelevant to the
desired inference. This suggests that the inference be based on only those
observations which do not equal eo. If there are N observations different
from eo and S < is the number smaller, then the distribution of S < , given the
value of N, is binomial with parameters Nand p, where p is defined by

p<
p= . (6.1)
p< + p>
It is obvious that p = 0.5 when p< = P> ,and that p is larger or smaller than
0.5 according as p< is larger or smaller than p>. A test of the null hypothesis
p< = p> (or equivalently that p = 0.5), against either one- or two-sided
alternatives, can then be based on S <. This amounts to omitting from the
sample those observations which equal ~o and applying to the remaining
observations a test of Sect. 3.1 for the null hypothesis that eo is the median,
using the reduced sample size. Any test of this type will be called a conditional
sign test, because it is conditional on the value of N.
The parameter p may itself be of interest, as may the quantity

2 p- 1 = p< - p> . (6.2)


p< + p>

Confidence bounds and other inferences about these parameters can be


obtained from the conditional distribution of S < given N, which, as already
mentioned, is binomial with parameters Nand p.
However, for a test ofthe null hypothesis that p< - p> is some value other
than 0, the parameter pin (6.1) is not determined unless p< + p> is a con-
stant. Therefore, exact methods of testing this hypothesis and hence of setting
confidence bounds for p < - p> are not easily developed. With large samples,
6 The Sign Test with Zero Differences 99

approximate methods can be used, based on the fact that S < - S> is ap-
proximately normally distributed with mean n(p< - p» and variance
n[p< + p> - (p< - p»2], provided that this variance is not too small
(Problem 42). The estimated variance S< + S> - [(S< - S»2jn] can be
used in place of the unknown true variance. It may be appropriate to in-
corporate a correction for continuity in the amount of! to S < - S>, although
the appropriate correction would be 1 rather than 1- in the case p< + p> = 1,
when S < + S> = n with certainty (Problem 42).

6.2 Conditional Properties of Conditional Sign Tests

Since the procedures and properties of tests concerning cases (a), (b) and (c)
of the previous subsection have already been discussed in Sect. 3, the remain-
der of this section will be devoted to those tests appropriate for case (d),
specifically, the conditional sign tests just described. We first consider a direct
argument in favor of performing a test for the null hypothesis H 0: p < = p>
conditionally on N, the number of observations which are not equal to ~o.
The argument is as follows. The number N does not pertain to the matter
under test but is, in effect, the sample size, because the n - N observations
which equal ~o (the "ties") are irrelevant. Accordingly, whatever test is
performed, the effective sample size N and the properties of the test for that
given N should be reported. Failing to do so would be tantamount to using
a procedure involving a sample size which is random but reporting its
overall properties for any size sample instead of its properties for the sample
size actually used. For example, suppose a sample is taken, of size n 1 or n2'
where the probability of each sample size is 1. Suppose further that a test at
level 0.01 is made when the sample size is nl and a test at level 0.09 when it is
n2' The overall level of this procedure is 0.05, but it would be misleading to
report simply that a test had been made at level 0.05, withholding the in-
formation about whether in a particular instance the level was really 0.01
or 0.09. This is not an argument against varying the level in this way, but only
an argument for quoting the level and sample size actually used.
Such an argument for a conditional procedure can be very compelling.
On the other hand, there are situations where the conditional argument leads
to chaos at best. It is not always possible to condition on everything one
might like. Worse yet, the conditional argument, together with some ap-
parently harmless assumptions, leads to the radical conclusion that tail
probabilities are irrelevant and inferences should be based only on the
probability ofthe actual sample under the various hypotheses (the likelihood)
[Birnbaum, 1962]. Thus conditioning poses fundamental problems for
orthodox inference methods; these problems have no satisfactory resolution
entirely within the frequency interpretation of probability. Even the radical
conclusion is in accord with Bayesian and likelihood philosophies, however.
100 2 One-Sample and Paired-Sample Inferences Based on the Binomial DistributIOn

Accepting the argument above for conditional procedures, at least as


applied to N in the situation currently under discussion, we are led to con-
sider the conditional properties, for given N, of tests in this situation. These
properties all relate to the conditional distribution of X 1> ••• , X n' given N.
This can be derived for any null or alternative distribution and hence the
conditional probability, given N, of rejecting the null hypothesis can be
computed for any test. The conditional level of the test is defined as the
maximum of this conditional probability over null distributions of
Xl' ... ' X n. Similarly, the conditional power against an alternative distribu-
tion is the conditional probability under that alternative, given N, of rejecting
the null hypothesis. A test is called conditionally unbiased against an alterna-
tive hypothesis if its conditional power against each alternative distribution
is at least equal to its conditional level. Some conditional properties imply
corresponding unconditional properties (Problem 44).
Conditionally on N. the present situation is no different from that dis-
cussed in earlier sections. Therefore, a one-tailed, level ex, conditional sign test,
that is, a one-tailed level ex sign test, applied to the N observations different
from ~ 0, has uniformly greatest conditional power against the appropriate
one-sided alternative among tests at conditional level ex. Further, the equal-
tailed, level ex, conditional sign test, that is, the equal-tailed level ex sign test
applied to the N observations different from ~o, has uniformly greatest
conditional power against any alternative with p< =I p>, among tests at
conditional level ex which are conditionally unbiased against the alternative
hypothesis p< =I p>, or symmetric in the sense defined in Sect. 3.2 (or both).
In summary, when we assume that ties may occur theoretically, that is,
when P(X j = ~o) > 0, the conditional sign tests have conditional properties
corresponding to the properties of the ordinary sign tests for the situation
where we assume P(Xj = ~o) = O. The correspondence carries over to other
properties which are not specifically discussed above, for instance to the
situation when Xl' ... , Xn are not identically distributed. The point is that
if one accepts the argument for conditioning on N, then all of the reasons for
using sign tests when ties have probability 0 are equally valid for using
conditional sign tests when ties have positive probability. Of course, the
remarks in Sect. 3.2 continue to apply, and improvement is possible even
when ties have positive probability if one is willing to make an additional
assumption, such as symmetry.
The next subsection presents another argument leading to conditional
sign tests by invoking (unconditional) unbiasedness. This unbiased ness
argument leads to conditional tests at the same conditional level for each N.
however, which the direct argument for conditional procedures does not
require. In the absence of a convincing and practical method for choosing
significance levels, this difference loses importance. As regards P-values, if a
one-tailed P-value is to be reported, either argument suggests finding the
conditional P-value, that is, the conditional probability given N, under the
null hypothesis, of an outcome as extreme as or more extreme than that
obtained.
6 The Sign Test with Zero Differences 101

6.3 Unconditional Properties of Conditional Sign Tests

In the situation under discussion, suppose we do not insist upon a conditional


test. If we seek instead one which is unconditionally unbiased, we will again be
led to a conditional sign test, with the slight difference just mentioned. In
fact, the following two statements are true.
(a) A one-tailed, level iX, conditional sign test is, from an unconditional
point of view (that is, unconditionally), uniformly most powerful against the
appropriate one-sided alternative, p< < p> or p< > p>, among level iX tests
which are unbiased against this alternative.
(b) An equal-tailed, level iX, conditional sign test is uniformly most
powerful against the two-sided alternative p< :f: p>, among level iX tests
which are unbiased against this alternative.

*6.4 Proof for One-Sided Alternatives

Consider the one-sided alternative p< < p>, that is, P(Xj < ~o) <
P(Xj > ~o). Let G be any distribution satisfying this alternative hypothesis.
In order to adapt the method of Sect. 3.3 to the present situation, for any p <
and p>, we define a distribution F as follows: PP(Xj < ~o) = p<; the condi-
tional distribution of Xj' given that Xj < ~o, is the same under F as under G;
PP(Xj = ~o) = 1 - p< - p> ; PF(Xj > ~o) = p> ; and the conditional dis-
tribution of X)' given that X J > ~o, is the same under F as under G (Problem
43a). Then the family ff of distributions F includes G and it includes a null
distribution for each value of p< = p> (Problem 43b).
For notational convenience we let S = S <, the number of observations
below ~o. Then for this family of distributions, the statistics Sand N are
jointly sufficient (Problem 43c). We shall show that any unbiased test based
on Sand N is conditional. We already know that, among tests at conditional
level iX, a one-tailed conditional sign test at level iX has uniformly greatest
conditional power against the appropriate one-sided alternative. The
desired conclusion, (a) of Sect. 6.3, then follows (Problem 44c).
It remains to show that any unbiased test 4>(S, N) is conditional. We note
first that,
EF [4>(S, N)] ~ iX (6.3)
for all null distributions F while
E K [4>(S, N)] ~ iX (6.4)
for all alternative distributions K, by unbiasedness. Second, every null
distribution F is a limit of alternative distributions (e.g., the alternatives
K = [(m - l)F + G]/m, or alternatives in the family ff for which p< and
p < approach their values under F). It follows that
EF [4>(S, N)] = rI. (6.5)
for all null distributions F.
102 2 One-Sample and PaIred-Sample Inferences Based on the Bmomial DIstribution

Consider now the null distributions of the family 17 above. For this
subfamily, N is a sufficient statistic (Problem 43d), and hence the conditional
probability of rejection given N is a function of N alone, say rx(N). That is,
for p< = p> = r/2 say,
Er[¢(S, N)IN] = rx(N) (6.6)
where rx(N) does not depend on r. Since (6.5) holds for all null distributions,
taking the expected value of both sides of (6.6) gives
Er[rx(N)J = rx for all r. (6.7)
Now N, the number of observations not equal to ~o, is binomially distributed
with parameters nand r = p< + p>. Since the family of binomial distribu-
tions is complete (Chap. 1, Sect. 3.4) it follows that rx(N) = CI. for all N, that is,
the test is conditional. This is all that remained to be proved. 0

Remarks

The type of argument employed in the previous two paragraphs often applies.
It is summarized in the following theorem, whose proof is requested in
Problem 45.

Theorem 6.1. Any unbiased test at level rx has probability exactly rx of rejection
on the common boundary K of the null hypothesis H 0 and the alternative H l'
If T is a complete sufficient statisticfor K, then any unbiased test at level CI. has
conditional level exactly rxfor all distributions of K, conditional on T. Hence if
T is a complete sufficient statistic for H 0, and if H 0 = K, that is, H 0 is con-
tained in the boundary of H to then any unbiased test at level CI. is a conditional
test, conditional on T.

"Complete" could be replaced by "boundedly complete" throughout,


meaning that there exists no bounded, nontrivial, unbiased estimator of 0
(cf. Chap. 1, Sect. 3.4). The "common boundary" means those distributions
which are limits of both null distributions and alternative distributions. The
relevant definition of limit here is that Fn --+ F if EFJ¢(X)J --+ EF[¢(X)] for
all bounded functions ¢, though stronger definitions typically hold also.*

*6.5 Proof for Two-Sided Alternatives

A level rx test of the null hypothesis p< = p> which is unbiased against
p< =I- p> must have conditional level CI. given N, as follows from either
corresponding one-sided statement, but it need not be conditionally unbiased
(Problem 46). Accordingly, in contrast to the one-sided case (Sect. 6.4), the
fact that the equal-tailed, level CI., conditional sign test is uniformly. most
6 The Sign Test with Zero Differences 103

powerful among conditionally unbiased tests does not imply directly that it is
uniformly most powerful among unconditionally unbiased tests. To prove
that it is, let G be any alternative and consider its family $i of distributions F
as defined in Sect. 6.4. We will prove that, among tests having level IX for the
null distributions of the family $i and unbiased against the alternatives of the
family $i, the equal-tailed, level IX, conditional sign test is uniformly most
powerful against these alternatives, and, in particular, is most powerful
against G. Since it is in fact a level IX, unbiased test for the original, more
inclusive, null and alternative hypotheses, it is, among such tests also, most
powerful against G (Theorem 3.1) and thus uniformly most powerful, since
G was arbitrary.
Now restrict the problem to the family $i. For this family, a sufficient
statistic is (S, N). The distribution of (S, N) may be described as follows. N is
binomial with parameters nand r = P< + P>, while given N, S is binomial
with parameters Nand P = p</r, as at (6.1).
We seek a test ¢(S, N) of the null hypothesis p = 0.5 (that is, p< = p»,
which is unbiased against the alternative P =F- 0.5 (that is, p< =F- p». Let
lX(r, p) = Er.p[¢(S, N)] (6.8)
be the power (the level, when p = 0.5) of the test ¢. Let
lX(pIN) = Er,p[¢(S, N)IN] (6.9)
be the conditional power (level, when p = 0.5) of ¢ given N, which is a
function of p and N alone, not depending on r, because the conditional
distribution of S given N is a function of p and N alone. If ¢ is unbiased at
level IX, then
IX(r, 0.5) ::; IX,
(6.10)
lX(r, p) ~ IX for p =F- 0.5.
It follows, as we saw in the one-sided case, that lX(r, 0.5) = IX and that the
conditionallevellX(O.SI N) = IX. It also follows (Problem 47) that

OIX~; p) = 0 at p = 0.5. (6.11)

Now (Problem 47) it is also true that


lX(r, p) = Er,p[¢(P IN)]. (6.12)
Since the distribution of N depends on r alone, we may differentiate with
respect to p under the expectation (Problem 47), obtaining

olX(r, p) = E [OIX(PIN)] (6.13)


op r,p op
and then, by (6.11),
OIX(PIN)]
Er,p [ op = 0 at P = 0.5. (6.14)
104 2 One-Sample and PaIred-Sample Inferences Based on the Binomial Dlstnbution

Since the family of distributions of N is complete, it follows that


oCt(plN) _ 0 - 05 (6.15)
op - at p - ..

We have now proved that if l/J(S, N) is an unbiased test of p = 0.5 against


p =ft 0.5, then it must be a conditional test and its conditional power must
have derivative 0 at p = 0.5. Recall that S is conditionally binomial with
parameters Nand p. As stated in Sect. 8.3, Chap. 1, only two-tailed tests are
admissible for this situation. The only such tests whose power has derivative
Oat 0.5 are equal-tailed. Hence the equal-tailed, level Ct, conditional test is the
only admissible, unbiased test based on (S, N). Since (S, N) is a sufficient
statistic when the problem is restricted to the family !F, this test, which is the
equal-tailed, level Ct, conditional sign test, is therefore uniformly most
powerful unbiased for the restricted problem. As mentioned initially, it
follows that this test is uniformly most powerful unbiased for the original
~~ 0
Notice that S, N - S, and n - N are the cell frequencies in a sample of n
where the three cells have respective probabilities, p <, p> ,and 1 - P< - p> .
Thus the restricted problem reduces by sufficiency to a trinomial problem.
The main part of the foregoing proof was essentially a proof that, in a
trinomial problem, when testing equality of two cell probabilities against a
two-sided alternative, a uniformly most powerful unbiased procedure is to
omit those observations falling outside the cells of interest and apply the
natural, equal-tailed binomial test to the reduced sample.
The argument involved showing that an unbiased test is conditional
(which is true even for one-sided alternatives), and that its conditional power
has derivative 0 at the null hypothesis. It was essential at (6.13) that the
distribution of N depend on r alone, and in going from (6.14) to (6.15) that
the conditional power depend on p alone. A most powerful, unbiased test
can be derived by this method for any exponential family if the null hypothesis
specifies the value of one of the" natural" parameters of the family [Lehmann,
1959, Sect. 4.4]. The trinomial distributions form an exponential family. In
place of the fact that only two-tailed tests are admissible, Lehmann uses a
generalization of the Neyman-Pearson fundamental lemma to two side
conditions. See also Sect. 8.3, Chap. 1.*

7 Paired Observations
Frequently in practice measurements or observations occur in pairs; the two
members of a pair might be treated and untreated, or male and female, or
math score and reading score, etc. While the pairs themselves may be in-
dependent, the members of a pair are related in some way. This relationship
within pairs may be present because of the nature of the problem, or may be
artificially imposed by design, as when experimental units are matched
7 Paired Observations 105

according to some criterion. The units or pairs may be drawn randomly from
some population of interest, and the assignment within pairs may be random.
If not, additional assumptions may be needed, depending on the type of
inference desired.
For example, suppose that a random sample of individuals is drawn, and
a pair of observations is obtained for each individual, like one before and one
after some treatment. Then each individual acts as his own "control." Under
the assumption (not to be treated lightly) that there is no time-related change
other than the treatment, one can estimate the effect of the treatment, and
with smaller sampling variability than if the controls were chosen indepen-
dently of the treated individuals. If instead each individual receives two
treatments, such as a headache remedy administered on two completely
separate occasions, it may be possible to assign the treatments to the occasions
randomly for each individual. For comparing the two treatments in terms of
some measure of effectiveness, this provides a similar advantage in efficiency
without requiring such strong assumptions. One of the treatments could, of
course, be a placebo or other control treatment.
More generally, suppose that the units to be observed are formed into pairs
in some way, either naturally or according to some relevant criterion, and
observations are made on both members of each pair. A pair here might be
two siblings, two litter mates, a husband-and-wife couple, two different sides
of a leaf, two different but similar schools, etc., or one individual at two times,
as above. If the matching is such that the members of a pair would tend to
respond similarly if treated alike, random variation within pairs is reduced
and the nonrandom variation is easier to observe. If a difference is then ob-
served between two treatments, the difference can be attributed to the effect
of the treatments rather than to random differences between units with more
assurance than could an equal difference observed in a situation without
matching.
If, within each pair, one unit is selected at random to receive a certain
treatment, and the other unit receives a second treatment (or serves as a
control), we have a matched-pair experiment. If the pairs themselves are
independently drawn from some population of pairs, we have a simple random
sample of pairs. Another possibility is to draw a simple random sample of
individuals and then form pairs within this sample. If either type of randomi-
zation is lacking, as in the before-after example above, it is especially impor-
tant to consider the assumptions required for the type of inference being made.
In an analysis of paired observations, it is technically improper and
ordinarily disadvantageous to disregard the pairing. A convenient approach
to taking advantage of the pairing usually results if the measurements on the
two members of each pair are subtracted and the analysis is performed on the
resulting sample of differences. To a great extent, this procedure effectively
reduces a paired-sample problem to a one-sample problem, but the assump-
tions which are appropriate for the sample of differences depend on what
assumptions are appropriate for the pairs.
106 2 One-Sample and Paired-Sample Inferences Based on the Bmomlal Distnbution

For example, suppose that each pair consists of a control measurement


V and a treatment measurement W. If the treatment has absolutely no effect,
it is often natural to assume that the measurements are permutable within
pairs, that is, (V., W;) has the same distribution as (w" J.-[) for each i. (This
holds, for instance, in a matched-pair experiment, where one unit of each pair
is chosen at random to be a control.) Under this assumption, and indepen-
dence between pairs, the null hypothesis of no treatment effect can be tested
by applying the methods of Sect. 3.1 (median equal to zero) to the differences
Xi = W, - J.-[. To discuss point estimation or confidence intervals, however,
we need to make some assumption about the treatment effect if there is one. A
strong assumption would be that the treatment has the same effect on every
unit, specifically, that any treated unit has a value larger by an amount p
than it would have had if untreated. Under this assumption and the earlier
one, (J.-[, W; - p) and (W; - p, J.-[) have the same distribution, and therefore
(Problem ~8) the treatment effect p is the median of the differences Xi =
W; - J.-[, for each i. If also the pairs are independent, then the methods of
Sect. 4 can be used to find a confidence interval for the common population
median of the differences Xi> which is here a confidence interval for the
treatment effect. If the treatment effect varies from unit to unit, however, the
situation is more complicated, and confidence intervals obtained in this way
may be invalid. Even with simple random sampling, the median of the
differences Xi need not equal the difference of the medians of W; and JIj.
Furthermore, in a matched-pair experiment, if the treatment has no effect on
the median of the distribution for any unit, but its dispersion increases in
direct relation to the median, then the differences Xi are typically skewed to
the right and their medians are typically negative but can also be positive.
See also Problems 49-52, Sect. 2, Chap. 3, and Sect. 9, Chap. 8.
When the relevant assumptions are satisfied by the differences of pairs of
observations, all of the procedures and properties of the one-sample sign test
discussed in Sects. 3,4, and 5 are equally applicable to the set of differences of
the matched or paired observations. Hence, further discussion of such tech-
niques will not be given here. We note, however, that the methods explained
later in Chaps. 3 and 4 may be particularly appropriate for differences because
completely arbitrary distributions are less plausible for differences than for
raw measurements, and the assumption of symmetry in particular is more
plausible for differences (if the measurements are continuous or nearly so).

8 Comparing Proportions Using Paired Observations


We will now discuss a common and important situation which reduces to an
application of the type of inference described in (d) of Sect. 6.1 which con-
cerned the relation between the parameters p < and p> in the presence of ties.
The present situation is simply described as a set of paired observations where
8 Comparing Proportions Using Paired Observattons 107

each observation is dichotomous and hence can be recorded as either 0 or 1.


Then the difference of variables in any pair can only be -1,0 or + 1, so that
many ties are likely. The analysis follows that of the sign test with ties, but it
will be worthwhile to discuss this application and its interpretation, especially
in relation to other similar situations. In this section we will present the test
procedure, its interpretation and properties, and some related matters
including a model discussed by Cox.
Suppose that the observations occur in pairs, each pair consisting of a
unit, element, or measurement of Type I and one of Type II. For instance, as
in Sect. 7, each pair might contain a control unit (Type I) and a treated unit
(Type II), or a measurement before (Type I) and after (Type II), or the two
types might be husband and wife, two sides of a leaf, etc. Suppose further that
each observation merely measures the presence or absence of some charac-
teristic or response. It is natural to represent this by scores of 1 and 0 respec-
tively. Then we have essentially the same situation of matched or paired
observations as described in Sect. 7, except that here the variables are
"indicator" functions since each observation is a dichotomous measurement,
being 1 or o. Some actual examples of this type of situation are:
(a) Some soldiers were asked whether they thought the war with the Japanese
would last over a year, given a lecture on the difficulty of fighting the
Japanese, and then asked the same question again [McNemar, 1947].
(b) Specimens were taken from the throats of persons suspected of having
diphtheria and grown on each of four media. Suppose two media were
to be compared with respect to the probability of growth taking place
[Cochran, 1950].
(c) Two drugs were tried on a number of patients, each drug on each patient,
to see which was more likely to cause nausea [Mosteller, 1952].
In the matched-pair situation described in Sect. 7, we were concerned
with inference about the population mean or median of the difference
between the two variables measured on each pair. The analogous comparison
here is between the population proportion PI of Type I scores which are 1 and
the proportion Pu of Type II scores which are 1, since these proportions are
the population means (expected values) of the two types of score, which we
will call score on I and score on II. Note that each observation is classified
as either Type I or Type II and as either score 1 or score O. For the examples
mentioned above, we might designate the types and scores as follows:
(a) Type I is before the lecture; Type II is after. Score 1 is yes; score 0 is no.
(b) Type I and II are the two media. Score 1 if growth takes place; score 0
if not.
(c) Type I and II are the drugs. Score 1 if nausea results; score 0 if not.
In what follows, we shall assume that a simple random sample of n is
drawn and two observations are made on each member, one of Type I and
one of Type II, and each observation is scored as either 1 or o. Thus we have
108 2 One-Sample and Paired-Sample Inferences Based on the BlI10mial Distribution

n independent, identically distributed pairs of dichotomous observations.


We will discuss making inferences about the difference PH - PI of the popu-
lation proportions of the two types of scores which are 1. (The same analysis
can be applied to a comparative experiment with nonrandom pairs as long
as the elements within a pair are randomly assigned to be either Type I or
Type II, and we are concerned only with testing a null hypothesis such as no
treatment effect whatever exists. See also Sect. 8.7.)

8.1 Test Procedure

In the situation under discussion here, the data might be recorded using the
format of Table 8.1. For each pair, we subtract the score on I from the score
on II to obtain differences which are either + 1, -1, or O. While there may
be any number n of pairs observed, there are only four categories of response,
that is four distinguishable pairs of scores, and these are listed in the table.
The last column shows the four symbols which we shall use to designate the
number of pairs observed in each of the four categories.
Now suppose that the null hypothesis of primary interest is that the
probability of a score of 1 on I is the same as the probability of a score of 1
on II, that is PI = PH' The difference PH - PI is equal to the probability of a
positive difference score II - I, minus the probability of a negative difference
score II - I (Problem 53). That is, the null hypothesis PH - PI = 0 is
equivalent to the hypothesis that the difference scores of + 1 and - 1 are
equally likely to occur in the population. Accordingly, the test suggested in
(d) of Sect. 6.1 applied to the numbers A, B, C, and D in Table 8.1 is appropri-
ate in this situation. (The number Bcorresponds to S < in Sect. 6.1.) The A + D
zero difference scores are ignored, and under the null hypothesis, given
B + C = N, the distribution of B is binomial with parameters Nand P = t.
Hence the sign test with zero differences or ties, which was introduced in
Sect. 6, can be used to test this hypothesis. The properties and interpretation
of this test, which will be called here a test for equality of proportions based
on paired or matched observations (it is also frequently called the McNemar
test) will be discussed later in this section. An example is given in Sect. 8.3.

Table 8.1

Score Difference Score Response Category Observed Number


I II II-I of Pair Category
III

1 0 A
1 0 -1 2 B
0 1 1 3 C
0 0 0 4 D
8 Comparmg Proportions Using Paired ObservatIons 109

8.2 Alternative Presentations

The relevant data of the situation under discussion can be presented in


another way, as shown in Table 8.2. Each observed pair falls into one of the
cells of this table, and the total number in each cell is recorded using the
symbols A, B, C and D defined as in Table 8.1.
Table 8.2 is a "double dichotomy", one kind of2 x 2 (contingency) table.
It may appear that the usual tests, namely, Fisher's exact test (Sect. 3.2,
Chap. 5) and the chi-square (approximate) test of independence or "no
association," are applicable. However, the hypothesis of interest here,
PI = PH' is not the usual one and cannot be tested by these procedures.
Independence of the row and column characteristics in a 2 x 2 table, here I
and II, is equivalent to equality of the population (unmatched) proportions
within the two columns, or within the two rows, which is not our present
concern. In our situation, association of the Types designated by I and II is
presumably present, whether by necessity or by design, and to a high degree.
We wish to take advantage of the association, not to test its existence.
For instance, consider situation (a) at the beginning of Sect. 8. A soldier
who is optimistic about the war before the lecture is more likely to be
optimistic after the lecture than one who is pessimIstic beforehand. Our
concern is whether the net result of such a lecture is to make more soldiers
pessimistic than were before, that is, whether the lecture can be expected to
change more optimists to pessimists than vice versa. Those soldiers whose
point of view does not change as a result of the lecture are, in a sense, irrelevant
to the point under discussion (see also Sect. 6). For this reason, the test
suggested here is based on the values of Band C alone. (See also Sect. 8.4.)
In this situation then, there is association between Types I and II because
of the matching, but we are interested in the equality of the proportions PI
and p". These would be estimated from Table 8.2 by (A + B)/
(A + B + C + D) and (A + C)/(A + B + C + D) respectively, the respec-
tive proportions in the first row and the first column. The difference P" - PI
is then estimated by (C - B)/(A + B + C + D), which is to be compared to
O. This is equivalent to comparing the proportions in the lower left and upper
right cells of Table 8.2, and also to comparing the proportion in the second
row with the proportion in the second column.

Table 8.2

Score on II
1 0

1 A 8
Score on I 0 C D
110 2 One-Sample and Paired-Sample Inferences Based on the Bmomial Distribution

Table 8.3
I II

Score
1 A+B A+C
0 C+D B+D

Table 8.3 shows another alternative method of presenting the data. With
this 2 x 2 format, the quantity we are interested in is the difference between
the proportions in the two columns, since this is equal to (C - B)/
(A + B + C + D). It may therefore appear that Fisher's exact test and the
chi-square test of "no association" are now applicable, but again they cannot
be used, in this case because the assumptions are not satisfied. The quantities
in the two columns labeled I and II are not independent. In fact, the numbers
or proportions here refer to matched pairs, and each pair appears twice in
Table 8.3, once in each column. An adjustment can be made, but it leads
either to the test already suggested or to a large-sample approximation to
that test (Stuart [1957]).
Of course, the format for presentation of the data is largely a matter of
taste and is irrelevant to proper analysis, provided that the situation is
correctly understood. Table 8.1 is quite clear, but is less compact than might
be desired. Tables 8.2 and 8.3, although compact, might lead to misinter-
pretation. In addition, since Table 8.3 gives only the marginal totals of
Table 8.2, it alone does not contain sufficient information for application of
the appropriate test which requires knowledge of at least Band B + C = N.
Table 8.2 is the more common, but the format of Table 8.1 generalizes more
easily to more than two types of unit or measurement when each observation
is still recorded as 0 or 1. This generalization is equivalent to 0-1 observa-
tions occurring as k-tuples rather than as matched pairs, and the usual test
procedure is Cochran's Q Test (Problem 57).

8.3 Example

Twenty married couples were selected at random from a large population


and each person was asked privately whether he would prefer that a week's
summer vacation for the family be spent in the mountains or at the beach,
all other factors being equal. The subjects were told to ignore factors such as
cost and distance so that their preference would reflect only their assessment
of the pleasure derived by the family from the two kinds of vacation. The
preferences expressed are shown in the pairs below, with B denoting beach
and M mountains and the husband's view of family preference always the
first member of the pair:
~m~m~~~m~m~~~~~~~m~m
~~~m~m~m~m~~~~~~~~~~
8 Comparmg ProportIOns Usmg Paired Observations III

Table 8.4
Preference Category
of(H, W) Number

(M,M) 5
(M, B) 8
(B, M) 3
(B, B) 4

The purpose ofthe study was to determine whether views offamily preference
for vacation are largely influenced by sex, and hence a possible source of
serious disharmony between husband and wife. Specifically, we wish to
determine whether a married man's view of family preference differs system-
atically or only randomly from his wife's view.
We first present the data in each of the formats that were described in
Sect. 8.2. The frequencies of occurrence for the four response categories of
pairs are easily counted. The results shown in Tables 8.4 and 8.5 are examples
of the general format of Tables 8.1 and 8.2 respectively. Table 8.6 is analogous
to Table 8.3, and it is clear here that the entries in the two columns are not
independent, because each couple appears twice in the table.
Suppose we wish to test the null hypothesis that the probability that the
husband responds mountains while the wife responds beach is equal to the
reverse type of disagreement, that is,
P(M, B) = PCB, M). (8.1)
If we add P(M, M) to both sides of (8.1), the left-hand side is simply the
probability that the husband responds mountains since
P(M, B) + P(M, M) = P[M, B or M] = PH(M)
say, while the right-hand side of (8.1) similarly becomes the probability
Pw(M) that the wife responds mountains. Hence the null hypothesis can also
be stated as either
PH(M) = Pw(M)or PH(B) = Pw(B),
which may be easier to interpret than (8.1). If H is Type I and M is score 1,
then PH(M) = Pw(M) represents PI = P" here. The ordinary binomial test

Table 8.5 Table 8.6

Wife's Preference
M B H W

Husband's M 5 8 M 13 8
Preference
Preference B 3 4 B 7 12
112 2 One-Sample and Paired-Sample Inferences Based on the Bmomial DistributIOn

with parameters N = 11, p = 0.5 is appropriate. From Table B, given 11


disagreements, the probability of obtaining 3 or less pairs of category (B, M)
is peS ::::;; 3) = 0.1133; this of course equals the probability of 8 or more
pairs of category (M, B). The two-tailed P-value is then 2(0.1133) = 0.2266.

8.4 Interpretation of the Test Results

The interpretation of the results of a before versus after test of any kind bears
close scrutiny. Suppose, for example, that the same characteristic is measured
before (I) and after (II) a treatment, and by a one-sided, level 0: test for equality
of proportions based on matched observations there are significantly more
1's after the treatment than before. Then, if the units constitute a random
sample from some population, the inference, at level 0:, is that if all elements in
the population had been treated, there would have been more 1's after
treatment than before. However, the inference that the population would
have changed in this direction if treated does not automatically justify the
inference that the treatment would have changed the population. The
observations in themselves provide no information about what would have
happened in the absence of the treatment. In order to make an inference
about the treatment, it is necessary either to assume that the proportion of
1's in the population would not have changed in the absence of treatment or
to run an additional control experiment (Problem 60).
In example (a) at the beginning of Sect. 8, for instance, it is reasonable to
assume that the soldiers would not have changed their opinions in the absence
of a lecture. Hence the effect, if real, may reasonably be attributed to the
lecture and the circumstances surrounding it. Of course, it is conceivable
that a dull lecture on any topic would have made the soldiers pessimistic
about a speedy end to the war. In example (c) if the drugs were given in the
same order to every patient, an apparent difference between the drugs
might be due to a time effect. If the order was randomized for each patient
independently, this difficulty would be obviated. Actually, in this experiment
exactly half of the patients in the sample were chosen at random and given
drug I first while the rest were given drug II first. This alters the null distribu-
tion of the test statistic if there is a time effect, but may be more powerful
inasmuch as it balances out the time effect by design rather than by random-
Ization. The test for equality of matched proportions will still be approx-
imately" valid" as long as the time effect is not too large and is "conservative"
in any case (Problem 61). An exact test is given in Problem 106c of Chap.
5. See also Problem 9c and text in Chap. 5.
Even If the effect is attributable to the treatment, the inference is limited
to the population sampled; the effect might be quite different on a popUlation
with a different proportion of 1's initially. Sometimes, of course, one might be
willing to assume that the effect on different populations would be in the same
8 Comparing Proportions Using Paired Observations 113

direction, especially if there is a continuous measurement underlying the


dichotomous one. For instance, the soldiers could have been asked how long
they thought the war would last. The reader may ponder to what extent the
effectiveness of the lecture on one population of soldiers guarantees its
effectiveness on another, such as a population more pessimistic initially, or
more experienced in fighting the Japanese, or made up of officers, or Marines,
etc.
With continuous measurements, one is often willing to assume or able to
verify that the treatment effect is approximately additive, that is, approxi-
mately the same on different populations. One possible corresponding
assumption for the present situation is discussed in Sect. 8.7. Such assump-
tions, however, do not justify transferring inferences to different populations
without assuming that the effect on different populations would be in the
same direction. In fact, they amount to assuming this and more.
If, initially, substantially more units score 1 than 0, there are many more
units available to change from 1 to 0 than from 0 to 1. One might argue that
this puts the treatment at a disadvantage (if 1 is better than 0). This does not
invalidate an inference about the effect of the treatment on the population,
but merely reflects the danger of tampering with the status quo if it is good.
It would be undesirable to give everyone polio vaccine unless the chance of
preventing polio in people who would otherwise have contracted it is much
greater than the chance of causing polio in people who would otherwise have
escaped it. This again emphasizes the possible danger in transferring the
inference to another population. A polio vaccine could be very advantageous
for children of the most susceptible age in a city undergoing an epidemic, but
very disadvantageous for use by everyone in the country. Some consider
yellow fever shots undesirable in the U.S. but desirable in some other
countries.
Another difficulty in interpretation, also by no means restricted to
situations involving matched proportions, is that it may not be clear what
population, if any, was actually sampled. For example, if the soldiers present
at the lecture constitute the population, and all ofthem were questioned both
before and after, one has a complete census of the population. To make an
inference about the effect of giving similar lectures to various groups at
various times, allowance must be made for the variation of the lectures, of
which one has a sample of only one, and for the fact that the group of soldiers
is not a sample of various soldiers at various times. Even if we were willing to
regard the soldiers as an independent, random sample of some sort before
the lecture, their responses afterwards are certainly not independent because
of the effect they have on one another as tbey listen to the same lecture. In
view of its sampling assumption, the test under discussion here may be only
partially relevant to the questions of interest. It may nevertheless be the most
nearly relevant procedure which is readily available. In situations where two
treatments (one may be a control) are compared by randomizing over
matched pairs, there is usually less problem in interpretation.
114 2 One-Sample and Paired-Sample Inferences Based on the BmomIaI Distribution

8.5 Properties of the Test

Some properties of the sign test with ties described earlier in Sect. 6 carry
over to the present situation. We assume throughout that the only data
available are the numbers A, B, C, and D in the four response categories for
a simple random sample of pairs, and that the null hypothesis is PI = PH'
with no further restriction on the probabilities.
Then, specifically, the one-tailed test as applied in this section has
uniformly greatest conditional power against the appropriate one-sided
alternative among tests at its conditional level, where "conditional" here
means" given B + c." The equal-tailed test has uniformly greatest conditional
power against any alternative among tests at its conditional level which are
conditionally unbiased against the alternative that the probability PI of
scoring 1 on I differs from the probability PH of scoring 1 on II. The one-
tailed, level ex, conditional test is, from an unconditional point of view,
uniformly most powerful against the appropriate one-sided alternative
PI < PH or PI > Pn, among level ex tests which are unbiased against this
alternative. The equal-tailed, level ex, conditional test is, from an unconditional
point of view, uniformly most powerful against the alternative PI i= PH'
among level ex tests which are unbiased against this alternative. Proofs are
requested in Problem 63.

8.6 Other Inferences

In the situation under discussion, inferences other than a test for equality of
paired proportions may also be of interest. Some of these will be discussed
in this subsection. We continue to use the notation introduced in Sect. 8.1,
that PI and PH are the proportions of pairs in the popUlation scoring 1 on
I and II respectively. We will also now be referring to the joint classification
of observations on the basis of scores of both Types; hence it will be con-
venient to introduce the notation Pii' for i = 0, 1, j = 0, 1, to denote the
proportion of pairs in the population with Type I score i and Type II score j.
F or example, Po 1 is the true proportion scoring 0 on I and 1 on II. Thus P11,
P10' POl and POD denote the true proportions of the population of pairs
corresponding to the observed numbers A, B, C, and D respectively in
Tables 8.1-8.3. Notice that PI = P10 + P11 and Pn = POl + P11'
The test already described in Sect. 8.1 was for H 0: Pn - PI = 0, or equiva-
lently POl - P10 = 0; under this null hypothesis, given B + C = N,
the test statistic B foHows the binomial distribution with parameters Nand
P = t. The difference Po 1 - Plois relevant for comparison of the proportions
of kinds of "disagreements" between scores for the two Types, or kinds of
"switches." This parameter corresponds to the difference parameter P< - P>
(see Section 6.1) discussed in the context of the conditional sign test with ties in
Sect. 6, so that the test and confidence interval procedures discussed there
are relevant here also.
8 Comparing Proportions Using Paired Observations 115

In the present context and notation, the parameter P defined in Eq. (6.1)
can be expressed as P = PlO/(PlO + POl). It represents the conditional
probability of a score of 1 on I given a disagreement between scores for the
two types. Since B is the number of observations scoring 1 on I and 0 on II,
we know that, given B + C = N, B follows the binomial distribution with
parameters Nand p. (Recall that B corresponds to the number S < defined
in Sect. 6.1.) Hence the usual binomial procedures are appropriate for tests
of hypotheses and confidence intervals for this P and also for (l/p) - 1 =
POdPlO (Problem 64a). In the present situation, the primary advantage of the
parameter (l/p) - 1 seems to be its adaptability to simple inference tech-
niques, but it will be given a useful interpretation in the next subsection.
Another quantity which might be of interest is the true proportion of
observations scoring 1 on II among those scoring 1 on I, or Pl dpi = Pl d
(PlO + Pl1)· This quantity is the conditional probability of a score of 1 on II
given a score of 1 on I. Alternatively, the conditional probability of a score
of 1 on II given a score of 0 on I might be of interest. Inferences about these
quantities can appropriately be based on the binomial distribution, and the
procedures are easily developed (Problem 64b).
We have discussed inferences concerning the "disagreements" between
scores on the two categories; now what about the "agreements?" For
example, one might be interested in testing the null hypothesis that the
probability of Type I and II scoring the same does not depend on the score
of Type I. This condition reduces successively to
P(samel1 on I) = P(same)
Pll/(PI0 + Pll) = Pll + POO
Pll(1 - PI0 - Pll - POO) = POOPI0
PllPOl = POOPIO' (8.2)
and the result is identical if we start with either of the relations P(samelO on
I) = P(same) or P( same lOon I) = P( same 11 on I). If the data are represented
in a new 2 x 2 table using the format of Table 8.7, it is clear that the usual
contingency table test of independence (of score on I and sameness) is
appropriate for the null hypothesis in (8.2). This is of course equivalent to a
test of equality of proportions within the rows of Table 8.7, or
Pll/(PIO + Pll) = + POl)
Poo/(Poo (8.3)
which in the present context says P(same 11 on I) = P(same 10 on I). Another
equivalent way of stating the null hypothesis in (8.2) is as an equality of odds,
or
PI dPlO = POO/POI (8.4)
which says here that the odds for" same given 1 on I" are equal to the odds
for "~ame given 0 on I," or
P(samell on I)/P(differentll on I) = P(samelO on /)/P(differentIO on I).
116 2 One-Sample and PaIred-Sample Inferences Based on the Bmomlal DlstnbutlOn

Table 8.7 Table 8.8

II I
same different same different

1 A B 1 A C
I II
0 D C 0 D B

A test of the null hypothesis that the probability of Types I and II scoring
the same does not depend on the score of Type II can also be performed using
a test of independence (of score on II and sameness), or a test of equality of
proportions within the rows of the new table shown as Table 8.8. This
hypothesis is not the same as (8.2)-(8.4), but is equivalent (Problem 65) to
PllPlO = POlPOO' (8.5)
These hypotheses of independence of sameness and score on I or II are
not as easy to interpret as they may seem. If there are many more scores of 1
than 0 on II, then it is easier to be the same given 1 on I than given 0 on I.
(Compare Sect. 8.4.) It may further exemplify the difficulty of interpretation
of independence in Table 8.7, and Table 8.8, to remark that independence in
both implies that P11 = POO and PlO = POl' except in the degenerate case
where either P(same) = 0 or P(different) = 0 (Problem 66).

8.7 Cox Model

In Sects. 8.1-8.6, the essential assumption was that the observed pairs con-
stitute a simple random sample from some population of pairs. In Sect. 7,
we indicated that in the case of matched pairs of continuous observations
an alternative assumption is often made. This is that the observations of a
given pair are random (as in a matched-pair experiment), while the pairs
themselves have "fixed effects" and need not be random at all. Each observa-
tion then reflects the fixed effect of the pair to which it belongs, as well as the
effect of the treatment (or Type). An analogous assumption for paired
dichotomous observations has been discussed by D. R. Cox (1958c). We
explain it here in the context of the drug example mentioned in (c) at the
beginning of Sect. 8, where two drugs are tried on a group of patients, each
drug once on each patient. Consider the group of patients as fixed, and sup-
pose that, for patient i, drug I causes nausea with probability PH and drug II
causes nausea with probability Pu i' Note that the randomness is now as-
sociated with different possible outcomes on a given patient, rather than with
the choice of a patient from the population. Suppose also that the outcomes
of separate trials on the same patient are independent (as trials on different
patients would be). If each drug is tried once on each patient, then the
8 Comparmg ProportIOns Usmg PaIred Observations II7

probability Plli that patient i scores 1 (nausea) on each drug is, by the
independence assumption,
Plll = PIiPIIi' (8.6)
Similarly, with obvious definitions, we have
PIOi = Pli (I - Pili) (8.7)
POli = (1 - PIi)PIIi (8.8)
POOi = (1 - P/i)(1 - PIIJ (8.9)
Suppose now that we make the additional assumption that the drug effect
is constant, in the sense that, for all patients, the odds for nausea under drug
e
II are the same mUltiple of the odds for nausea under drug I. In symbols,
this assumption is

~P--,-II:..:...1- = e_P_l_i- for all i. (8.10)


1 - Pili 1 - Pli
Notice that when there is no difference between the drugs we have e = 1.
With the assumptions (8.6)-(8.9), e in (8.10) is given by

8 = (1 - Pli)PIIi = POli for all i. (8.11)


Pli(1 - PIli) PlOi
The fact that 8 is the same for all i means that every patient has the same
conditional probability of a score of 1 on drug I given that he scored 1 on
exactly one of the drugs. Specifically, this conditional probability is

PlOi = _1_ for all i. (8.12)


PlOi + POli 8 + 1
Under these assumptions, then, it is again true that, given B + C = N, B
is binomial with parameters Nand P = 1/(8 + 1). Hence the usual binomial
procedures can be applied to obtain inferences about p, and hence also
about lJ = (lip) - 1. Here, however, the underlying assumptions are much
stronger, and as a result 8 has an interpretation not previously available. In
particular, Eq. (8.10) implies that for every patient the same drug has the
higher probability of causing nausea, and indeed by the same amount in the
sense of mUltiplying the odds by the same factor.
This model for dichotomous observations corresponds to an additive
model for continuous observations with a pair effect (Xi and a treatment
difference p, as can be seen by taking logarithms in (8.10). (Here log Plil
e
(1 - Pli) plays the role of (Xi while log plays the role of p.) This is a special
case of an adaptation of the general linear or regression model to dichoto-
mous dependent variables (Cox [1958b, 1970]) where the logarithm of the
odds is assumed to be a linear function, with unknown coefficients, of some
"independent" variables whose values are known (and which may be design
or dummy variables).
118 2 One-Sample and Paired-Sample Inferences Based on the Blllomlal Distribution

*The properties of the binomial tests discussed in Sect. 8.5 carryover to


the present situation in the following form (Problem 67). Regarding the P/i
as nuisance parameters (the Pili are then functions of () and the P/i), one-tailed
binomial tests for P are uniformly most powerful unbiased tests for () against
one-sided alternatives, and the unbiased two-tailed binomial tests for pare
uniformly most powerful unbiased tests for () against two-sided alternatives. *

9 Tolerance Regions
One-sample procedures based on the binomial distribution are also useful
for obtaining tolerance regions. The methodology can be viewed as a general-
ization of the procedure for constructing confidence limits for the median
or any specified p-point (quantile) of a distribution. Because of difficulties
analogous to that of defining a unique p-point for discrete distributions, it is
convenient to assume throughout this section that the relevant distribution
is continuous.
Recall that the median of a population is the 50 %point ofthe distribution,
or the point such that 50 %of the population lies below it. Let X* be an upper
95 % confidence bound for the population median. This means that X* is
obtained in such a way that it has probability 0.95 of exceeding the population
median. It follows that X* has probability 0.95 of exceeding at least 50% of
the population. Equivalently, the region to the left of X* has probability 0.95
of covering (including) at least 50 % of the population. This is perhaps the
simplest example of a" tolerance region," more specifically, a " 50 %tolerance
region" at the" confidence" level 0.95.
In this section we will define tolerance regions exactly, mention some
practical situations where they might be useful, and explain a simple method
of constructing them from a random sample. Then we will discuss their
usefulness for description and prediction, pointing out some difficulties in
the interpretation of tolerance regions. Finally we will generalize the con-
struction procedures. The question of what would be a "good" or "best"
tolerance procedure will not be discussed.

9.1 Definition

For any fixed region R of a given population, we define the coverage of R


as the proportion of the population which lies in R, that is, the proportion
of the population covered by R. In random variable terminology, the
coverage of R is
C(R) = P(X E R) (9.1)
where X is drawn at random from the population.
9 Tolerance Regions 119

Suppose that, for some purpose, we would ideally like to find a region with
coverage 0.5, that is, a region including 50 % of the population. Lacking
special knowledge about the population distribution, we cannot accomplish
this exactly. We might be willing, instead, to define a region (depending on a
sample) so that there is probability 0.95 that it will have coverage at least 0.5.
This would perhaps sound difficult to do, had an example not been given
above.
In general, a tolerance region is a random region having a specified
probability, say 1 - a, that its coverage is at least a specified value, say c.
Various names are given to 1 - a and c in the literature. We shall callI - a
the confidence level and c the tolerance proportion, the latter because in some
situations it is the minimum proportion of the population which it is con-
sidered tolerable to cover. We shall also speak of a "c tolerance region with
confidence 1 - a." Regions which have this property under essentially no
restrictions on the population are sometimes called "nonparametric toler-
ance regions," to distinguish them from "parametric tolerance regions,"
which have the required property as long as the population belongs to some
specified parametric family, but not in general otherwise. Only nonpara-
metric tolerance regions will be discussed here.

9.2 Practical Uses

In nonspecific settings, tolerance regions are often suggested for the purpose
of describing the underlying population or for predicting future observations.
For these purposes, however, a tolerance region at a conventional confidence
level like 0.95 is of doubtful value. For description, for instance, the difficulty
of interpreting a tolerance region is analogous to the difficulty of interpreting
a confidence bound by itself as an estimator. Such difficulties, and possible
remedies, will be explained further later, in Sects. 9.4 and 9.5.
The specific context in which tolerance regions are most often employed
is that of production processes, since then it is natural to be concerned with
whether the items produced are meeting specifications or measuring within
some design tolerances, such as 100 ohms ± 10 ohms. If certain deviations
of various characteristics from designated values are specified in advance as
tolerable, it is easy to make non parametric inferences about the proportion
of the population in this region of tolerable values (Problem 68). This is not
the type of tolerance region defined above, however; we shall be concerned
here not with pre specified tolerance limits or regions of tolerable values, but
rather with finding a tolerance region, based on a sample, such that a pre-
specified proportion of the population will be covered by that region with a
preselected level of confidence. Because these regions are based on a sample,
they are often called "statistical" tolerance regions. We will not repeat the
adjective in the discussion to follow, since all tolerance regions here will be
statistical.
120 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution

These tolerance regions are also frequently used in connection with a


production process. Suppose that no particular specifications are of special
interest and one merely wishes to establish a rule for keeping tabs on the
process. The process might be watched carefully for a limited period in order
to collect sample data to use in finding a tolerance region having, for example,
tolerance proportion 0.99 and confidence level 0.95. Thereafter, a cause of
trouble is sought only when a sampled observation falls outside the tolerance
region thus established. Whatever limits are set on the basis of the first sample,
there will be some long run proportion of trouble-shooting even when the
process is in control. The probability is 0.95 that the first sample will set
tolerance limits such that trouble-shooting will be required at most 1 % of
the time in the long run, if the process stays "in control," that is, does not
change. Of course, more complicated conditions for trouble-shooting might
well be used in practice (Problem 69).
As another possible use of tolerance limits in production processes,
suppose that a producer wants to make some kind of money-back guarantee
that his output will lie in a certain range, and he does not care exactly what
the range is as long as it is not too much wider than necessary. Of course, he
wants to be reasonably sure that no more than a very small proportion of his
production will fall outside the guaranteed range. If the guaranteed range is
a 0.995 tolerance interval with confidence 0.99 say, then the probability is
0.99 that, in the long run, at least 99.5 % of the production will satisfy the
guarantee and the probability is only 0.01 that as much as 0.5 % will fail to
satisfy it. This assumes no change in the process.
As an example from another field, consider setting norms for a physio-
logical measurement, say the level of cholesterol in the blood. 1 Suppose that
ideally one would like to be able to say that 98 % of normal people have
between a and b milligrams of cholesterol per milliliter of blood as measured
in a particular way. However, limits must be set on the basis of measurements
on a finite sample of normal people. The endpoints of a 0.98 tolerance interval
with confidence 0.95 might be chosen as limits of the norm, that is, the
"normal range." (For example, the normal range of total serum cholesterol
in adults is 150-250 mg/100 ml of blood.) Then the probability is 0.95 that
the limits will include at least 98 % of normal people, and the probability is
only 0.05 that more than 2 % of normal people will fall outside.
In each of these situations, once the concept of tolerance regions and the
difficulties of interpretation discussed in the next two subsections are clearly
understood, serious objections to tolerance regions as a solution to the real
problem at hand will come readily to mind. To improve upon them sub-
stantially, however, is not so simple and requires consideration of aspects of
each individual problem which were not even touched on above. Further,
the relevant information on these aspects may be difficult or impossible to

I The authors are mdebted to Frederick Mosteller for a discussion of uses of tolerance regions,

and III partlclliar for suggestlllg thiS example


9 Tolerance RegIOns 121

find. For example, clearly relevant but hard to estimate are the costs of the
various possible acts in each problem, both when the production process is
unchanged or the people are normal as regards cholesterol level, and when
any of a variety of possible alternatives holds.

9.3 Construction of Tolerance Regions: Wilks' Method

Assume that Xl' ••• , Xn are independent observations on the same distribu-
tion, with c.dJ. F. Let X(1)"'" X(n) denote the order statistics of these
observations. Assume that F is continuous, so that, with probability one,
there are no ties and a unique ordering X(1) < X(2) < ... < X(n) exists. Let
Ck be the coverage of the interval between X(k-l) and X(k)' Then by (9.1) we
have for k = 2, ... , n,
(9.2)
(Since F was assumed continuous, the coverage is the same whether the
endpoints are included in the interval or not. If a specific statement were
required, we would assume that right (upper) endpoints are included, and
left (lower) endpoints are not.) We further define
C l = F(X(1» and Cn+ 1 = 1 - F(X(n)
as the coverage of the interval below X(l) and the interval above X(n) respec-
tively. The definition in (9.2) applies also to these two intervals once we define
X(O) = - 00 and X(n+ 1) = 00.
We now have n + 1 coverages, C l , C 2 , ••. , Cn + l , corresponding to the
n + 1 intervals into which the n sample points divide the real line. These n + 1
coverages are random variables, and their joint distribution has a number of
interesting properties (Problems 70 and 71). A property which provides an
easy method of construction of tolerance regions is the following. Let
iI, ... , is be any s different integers between 1 and n + 1 inclusive; then the
sum C of the corresponding coverages,
C = Cit + C + ... + Cis'
I2

has the same distribution as the sth smallest observation in a sample of n from
the uniform distribution on the interval (0, 1) (Problem 71g), namely

s _- 1
P(C ;;::: c) = n( n 1) i u c
l
S- l (l - u)n-s du (9.3)

sil (n)C
k=O k
k(l _ ct- k. (9.4)

Notice that the distribution depends on sand n only, not on which s integers
are chosen, nor on the distribution from which the sample was drawn, as long
as F is continuous. Notice also that C has a beta distribution by (9.3) and that
(9.4) is a left-tail binomial probability.
122 2 One-Sample and Paired-Sample Inferences Based on the Bmomial Distribution

Thus we have the following simple method of constructing tolerance


regions. Select s distinct integers between 1 and n + 1 inclusive, and for each
integer i selected, include the interval between X(i-l) and X(i) in the tolerance
region. This gives a tolerance region with tolerance proportion c and con-
fidence level 1 - Q( if s is the critical value for an upper-tailed, level Q( binomial
test of the null hypothesis p = c based on a sample of size n. Equivalently,
1 - Q( = P(C ;;:: c) is the probability of s - 1 or less successes, and Q( =
P(C < c) is the probability of s or more successes, under the binomial distri-
bution with parameters nand p = c. Thus, given the sample size n and the
tolerance proportion c, the confidence level corresponding to each value of
s can be obtained from a binomial table with this nand p = c.
If the s integers chosen are 1,2, ... , s, then the tolerance region obtained
by this method is simply the interval with endpoints - 00 and X(s)' where
X(s) is the upper confidence bound for the c-point ~c of the distribution, as
explained in Sect. 4. In this sense, tolerance regions are a generalization of
confidence bounds for quantile points, and the probabilities obtained
previously continue to apply in this case, as indicated in connection with the
median at the beginning of the section.
A more usual procedure is to choose s integers, k + 1, k + 2, ... , k + s
where 1 ~ k ~ k + s ~ n. Then the tolerance region is the interval with
endpoints X(k) and X(k+S)' both finite. Note that, in accordance with what
has already been said, the confidence level associated with this interval, when
regarded as a tolerance interval, depends only on s and not on k. (This is not
true when the interval is regarded as a confidence interval for a quantile
point. See Problem 72.)
Sometimes it is convenient to consider not the number s of included
intervals but instead the number m of excluded intervals. The procedure then
would be to select m distinct integers between 1 and n + 1 inclusive, delete
the interval between X(i-l) and X(i) for every selected integer i, and let the
tolerance region consist of the remainder of the real line. Then (Problem 71h)
the confidence level 1 - Q( is the probability of m or more successes and Q( is
the probability of m - 1 or less successes, under the binomial distribution
with parameters nand p = 1 - c (not p = c). Again, given nand c, the
confidence level 1 - Q( corresponding to each value of m can be obtained
from a binomial table with this nand p = c. Ifthe excluded intervals lie at the
extremes of the sample, then the resulting tolerance region is again an interval.
To decide on the sample size n, one might select m, c, and Q( and choose n
just large enough so that the tolerance region omitting m intervals has
tolerance proportion c and confidence level at least 1 - Q(. Determining the
required n is fairly easy using binomial tables. One might, for example, plan
to omit just two intervals, the ones to the left and right of all sample values,
and use as a tolerance region the interval with endpoints X(l) and X(n) , that
is, the entire range of the sample; this fixes m = 2. If the k leftmost and I
rightmost intervals are omitted, so that the tolerance region is the interval
whose endpoints are the kth smallest and Ith largest observations X(k) and
'l:l

;!
1>
..,
I 1111 I>'
Qm r !:I
~
0.995 1 ! 11 _P 1;'
0.99 I Ii: .......r-....J.::;:::~ ""o
0.98 ~ ~
'"
9.95 I V~V ... ~~~
0.90 <l =10.05 I i __/),::,:~~~~~~~~ I
0.80 I // /~~V-:---,;" ~~//
0.70 V V///:/'~V 1./ V V'/h V /V I

0.60 ./ r' 1./7./:v./v V / / v // v /


0.50 / V V / / / / /1/ 1I V/ /
0.40 / / V/ / J / / V I ~I I
0.30 V / 1/ Vv vuY.!?/ / V17 V :qL
0.29 :y ~ v J~tL .,tY lli ~v~~ ~VJ!~ U" i
I III I
0,01 1 2 3 4 5 6 7 8 9 10 20 30 40 SO 60 70 8090100 200 300 400 500 n
:F f rwvrw ,1~Nrr~ ~~
Figure 91 PopulatIOn coverage of mterval X(k) to X(n+l-l) with confidence 0.95, m=k+l (Source:
Adapted from Figure 2 on p. 584 of Murphy, R B. [1948], "Nonparametric tolerance limits," Annals of
Mathematical StatistIcs, Vol. 19,581 - 589, with permissIOn of the Institute of Mathematical Statistics.)
tv
W
124 2 One-Sample and Paired-Sample Inferences Based on the Binomial Dlstnbution

X (n + 1 -I), then m = k + I. If values of c and ()( are also selected, then n is the
smallest value for which the probability of m or more successes exceeds
1 - ct, under the binomial distribution with parameters nand p = 1 - c.
Murphy [1948] gives graphs of c versus n for various values of m and
1 - ct = 0.90,0.95, and 0.99. These are somewhat easier to use than binomial
tables for some purposes. They are essentially equivalent to graphs of
binomial confidence limits or percent points of the beta distribution (as, for
instance, Fig. 6.2, Chap. 1), arranged in a certain way (Problem 73). Figure 9.1
reproduces the graph for 1 - ()( = 0.95; m denotes the number of intervals
omitted, as explained in the previous paragraph.

9.4 Tolerance Regions for Description

As a descriptive device, a tolerance region with a high confidence level is


deceptive, especially if it is based on a small amount of data. The reason can
be explained as follows. Consider, for instance, a 0.65 tolerance region with
confidence 0.95. If, as in the previous subsection, there is no probability that
the coverage is exactly 0.65, then such a region has probability 0.95 of
covering more than 65 % of the population and probability only 0.05 of
covering less than 65 %. Then "typically" the coverage will be more than
65 %, and substantially more if the region is based on little data. This is
clearly reflected in Fig. 9.2, which shows the distribution of the actual
coverage C of a 0.65 tolerance region with confidence 0.95 based on 20
independent observations and obtained (Problem 74) by the method de-
scribed above. It is as though the tolerance proportion 0.65 were a lower 0.95
confidence bound for the actual coverage, except that what is random is the
region rather than its tolerance proportion.
Another aspect of the same phenomenon is that a tolerance region having
a particular tolerance proportion at a specified confidence level has at the
same time other tolerance proportions at other confidence levels. For
example, the region just mentioned, based on 20 observations, is not only
a 0.65 tolerance region with confidence 0.95, but also a 0.50 tolerance region
with confidence 0.999, a 0.75 tolerance region with confidence 0.775, etc.
These and other combinations of tolerance proportion and confidence level
can also be read from Fig. 9.2 (Problem 74).
The relation between the tolerance proportion and confidence level
depends on the sample size (for any reasonable method of obtaining tolerance
regions). Thus it is impossible to know, let alone to apprehend, just how
conservative a tolerance region is merely from knowledge of the tolerance
proportion and confidence level.
The difficulties just mentioned can be seen clearly in the special case
mentioned earlier when the tolerance region is the interval from - 00 to
X(S)' and X(S) is an upper confidence limit at level 1 - ct for the population

c-point ec. Then the use of the tolerance region for description amounts to
9 Tolerance Regions 125

1.0

0.9

0.8
~
VI 0.7
U
is::'
"
8
0.6

5
'0 0.5
<::l
Q
8
0.4

0.3

0.2

0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p = coverage

Figure 9.2 Distribution of actual coverage of a 0.65 tolerance region with confidence
0.95 when n = 20.

the use of the upper confidence limit XiS) as a sample descriptor of ~c. How-
ever, a single confidence limit at a typical level would not ordinarily be
considered a descriptor in this sense, and would be a very lopsided descriptive
device. For example, an upper 95 % confidence limit for a parameter, say the
median, is not by itself very descriptive of what one knows about the param-
eter, since the limit could be any distance from the parameter.
Some methods of avoiding this deceptiveness when using tolerance regions
as a descriptive device are listed below.
(1) Use the confidence level 0.50. Then the tolerance region is analogous to
a median unbiased estimator. 2
(2) State several combinations of the tolerance proportion and confidence
level for the region given, for instance, the tolerance proportions corre-
sponding to the confidence levels 0.05, 0.25, 0.50, 0.75, and 0.95.

2 An estimator T is called median unbiased for a parameter () if T has median () for any allowed
distribution. A Significant fact about median unbiasedness is that h(T) is median unbiased for
h«(}) if T is median unbiased for () and h is monotone (see also van der Vaart [1961]). This prop-
erty does not hold for ordinary (mean) unbiasedness.
126 2 One-Sample and Paired-Sample Inferences Based on the Bmomlal DistributIOn

(3) Instead of giving any tolerance proportion and confidence level, give the
expected coverage, as defined in the next subsection. This is somewhat
analogous to unbiased estimation.
(4) Give two regions, one a tolerance region as defined already, and the
other what might be caIled an inner tolerance region with the same
tolerance proportion c and confidence level 1 - ct. This is analogous to
giving both upper and lower confidence bounds. By an inner tolerance
region is meant a region having probability 1 - ct that its coverage is at
most c. It can be chosen to lie inside the ordinary tolerance region (as
long as ct < 0.5). Its complement is a tolerance region with tolerance
proportion 1 - c and the same confidence level 1 - ct.

9.5 Tolerance Regions for Prediction

The prediction problem we shaIl discuss is as follows. A sequence of observa-


tions is to be made, and after the first n observations, we want to predict
whether the (n + 1)th observation will lie in some region. The first n obser-
vations may be used in constructing this region, and it should provide
adequate probability of correct prediction. More precisely, given n + 1
independent, identically distributed random variables X 1> ••• , X n' X n + 1>
construct a region depending only on X 1> ••• , X n. What is the probability
that X n + 1 will lie in the region?
The question is still not precise, and there are two natural ways to define
this probability. One is conditional on Xl, ... , X n , and then the probability
is just the coverage of the region. That is,
P[Xn+ 1 E R(X 1,···, Xn)IX 1, ... , XnJ = C[R(X 1' ... , Xn)] (9.5)
where R is the region. The other is unconditional, and then the probability
is the expected coverage of the region, or
P[Xn+ 1 E R(X 1, ... , Xn+ 1)] = E{C[R(X 1, ... , Xn)]}. (9.6)
The definition in (9.5) is the tolerance proportion or coverage we have been
discussing in the previous subsections. This probability is relevant to an
infinite number of predictions, being the proportion of correct predictions
if each of the future observations X n + 1, X n + 2 , ••• is predicted separately to
lie in R(X 1, ... , X n). While this proportion is unknown, depending on the
unknown population distribution, we can make confidence statements about
it if R is a tolerance region.
The probability in (9.6) is the probability, before any observations are
taken, that the particular prediction X n+ 1 E R(X 1, ... , Xn) will prove
correct. For tolerance regions of the type we have discussed, this probability
does not depend on the unknown population distribution, and is simply
s/(n + 1) (Problem 71), where s is the number of intervals X(i-l) to X(I) in the
9 Tolerance RegIOns 127

region R. For example, before any observations are taken, the probability
that the (n + l)th observation will lie within the entire range X(l) to X(n) of
the first n is (n - 1)/(n + 1), and the probability it will lie between any two
successive observations XO-I) and X(i) is 1j(n + 1).
If an ordinary tolerance region at a high confidence level is used for the
region R, then the probability in (9.6) will be larger, and may be much larger,
than the tolerance proportion c, since the actual coverage C(X 1, •.. , X n)
exceeds the tolerance proportion with probability equal to the confidence
level chosen. This illustrates in another way the difficulty of making a simple
interpretation of a tolerance region as a descriptor.

9.6 More General Construction Procedures

The procedure described in Sect. 9.3 for constructing tolerance regions can
be generalized in a number of ways. As an aid to understanding, we shall
proceed informally and one step at a time.
Suppose first that we have a sample of bivariate observations rather than
univariate observations, and we seek a tolerance region in two-dimensional
rather than one-dimensional space. Of course, we could look only at the first
coordinate of each observation and construct a univariate tolerance region
based on these n univariate observations. The corresponding (equivalent)
region in the plane would then be a bivariate tolerance region. For instance,
if the univariate region is an interval I, then the corresponding bivariate
region would be simply the vertical band whose intersection with the hori-
zontal axis is I.
A more interesting possibility would be to look at some real-valued
function ¢ other than the first coordinate of the bivariate observations, which
we denote by XI. Suppose that Z = ¢(X) has a continuous c.dJ. G. Define
a tolerance region S in Z-space based on the order statistics Z(1)' ... , Zen)
of the sample of values Z; = ¢(X;). Let R be the corresponding region in
X-space, or formally, R = {x: ¢(x) E S}. Then R has the same coverage for
X that S has for Z and hence is a tolerance region in X -space with the same
tolerance proportion and confidence level. For example, if ¢(X) is the distance
of the point X from the origin, or the length of the vector X, then the Z; are
the distances of the X; from the origin. If the Z tolerance region S is the interval
from Z(k) to Z{k+s), then the corresponding X tolerance region R is the ring
consisting of those points x whose distance from the origin is between Z(k)
and Z(k+S). If some other well-behaved function ¢ had been used in place of
distance, the boundaries of R would still be the two contours where ¢ has
the values Z(k) and Z(k+s).
The same method can be applied to any kind of X-space whatever, by
letting ¢ be a real-valued function on this space. We shall require only that
¢(X) have a continuous distribution G (this avoids the difficulties of discrete-
ness). To carry the ideas a bit further, let Z(1)' ... , Zen) again be the order
128 2 One-Sample and Paired-Sample Inferences Based on the Bmomial DistrIbutIOn

statistics of the sample of values Zj = ¢(X;). The Z(I) separate the Z-space
(the real line) into n + 1 intervals. Let R 1 , ••• , Rn+ t be the corresponding
regions in X-space, that is, Ri is the X-region where ¢(X) is between Z(i-l)
and Z(l)' Specifically, R j = {x: Z(i-1) < ¢(x) s Z(l)} where Z(o) = - 00 and
Z(n+ 1) = 00 as before. The coverage C j of R, is the probability under the X
distribution in the region R j , which is the probability under the Z distribution
in the interval between Z(i-l) and Z(i)' which is G(Z(I» - G(Z(i-l)' The
joint distribution of these coverages is therefore the same as in Sect. 9.3, and
in particular, the union of any s of the regions R, is a tolerance region having
a coverage C whose distribution is given by (9.3) and (9.4). The indices
iI' ... , is of the included regions are to be selected in advance, of course.
Regions R j whose coverages have the same joint distribution as in Sect. 9.3
are called "statistically equivalent blocks," the equivalence being that any
permutation of the coverages has the same joint distribution as any other.
By generalizing the procedure for constructing statistically equivalent blocks,
we can obtain more general tolerance procedures. We shall proceed further
in this way.
Instead of using the same function throughout, we could use a sequence
of functions ¢ t, ¢2, ... , ¢n' One way to do so is as follows. First, let R t be the
region where ¢t(X) is smaller than the smallest value ¢l(X j ) observed. Next,
remove the minimizing X, from the sample and Rl from the X-space, and
apply the same procedure to the remaining sample and the remaining portion
of X-space, using ¢2 in place of ¢1' And so on. At each stage, the remaining
X's are a sample from the original distribution except restricted to the re-
maining portion of the X -space; therefore, the conditional distribution of the
coverage of the next region to be removed given the coverages of the regions
already removed is the same no matter what function ¢} is used next and
hence is the same as when all functions ¢ j are the same. Therefore, the regions
R 1, R 2 , ••• , Rn+ 1 obtained are again statistically equivalent blocks.
The first step above may be thought of as using ¢ 1 to cut the X -space into
two regions, one consisting of one block and one consisting of n - 1 blocks
not yet subdivided. The second step then uses ¢2 to cut the latter region into
two regions, of 1 and n - 2 blocks, etc. Geometrically, the successive cuts
are along contours of ¢1> ¢2' ... , ¢n' Instead of cutting off one block at each
step, however, we could choose some arbitrary split. In this case, the first
step is to choose an integer r 1 between 1 and n, find the r 1 th from the smallest
of the values ¢1(X,), i = 1,2, ... , n, and cut the X-space into two regions
according to whether ¢l(X) is smaller or larger than this rtth value. We then
have one region containing 1'1 - 1 X's and still to be subdivided into 1'1
blocks, and a second region containing n - 'I X's and still to be subdivided
into n - r l + 1 blocks; the remaining X, say X[ll' is the borderline value
through which the first cut passes. The second step is to cut one of these two
regions, using the function ¢2 and another arbitrary integer. After n steps, n
cuts have been made, all the Xi have been used, and there are n + 1 regions,
each fully subdivided, i.e., consisting of a single block. These n + 1 regions
9 Tolerance Regions 129

are again statistically equivalent blocks, by a slight generalization of the


previous reasoning.
An even further generalization is to let the function and/or integer used
in the second step depend on the borderline value XII] through which the
first cut passes. Similarly, the function and integer used in any later step may
depend on the borderline X -values (and functions and integers) used in earlier
steps. They cannot, however, depend on the Xi other than the borderline
values in earlier steps. Since all the Xi must be examined to determine the
borderline values, it is essential for preventing any dependence on other than
borderline values that the procedure be fully specified in advance (or perhaps
that the borderline values be determined by a second party or a computer
without revealing any other values).
One other point should be clarified. "Statistical equivalence" of course
does not mean that a tolerance region could be formed from the s smallest
blocks obtained, for instance. The indices of the blocks to be included must
be specified in advance, and the indexing of the blocks must be carried out
in such a way that statistical equivalence holds. (Some relaxation of the
former is possible, but adjustment of the indexing process can accomplish
the same thing.) The requirement on the indexing is that an appropriate
number of indices must be assigned to each region at every step, and more
particularly that, before each step, from the indices already assigned to the
region about to be cut, an appropriate number must be selected and assigned
to each of the two regions which will result from the cut. What this amounts
to for the tolerance procedure is that, at every step, in each region, the number
of blocks which will ultimately belong to the tolerance region must be
specified, and more particularly that, before each step, the number of blocks
which will ultimately belong to the tolerance region must be specified for
each of the two regions which will result from the next cut. Again, dependence
is permitted on previous borderline X-values but not on other Xi'
We shall not attempt to state fully and formally the most general procedure
obtainable along the foregoing lines for either statistically equivalent blocks
or tolerance regions. Such a statement is bound to be very cumbersome and
therefore difficult to understand and use in checking the validity of a proposed
procedure. It is probably easier to understand the ideas and validate pro-
cedures directly on the basis of this understanding.
The generalizations above are applicable even in univariate situations.
For example, given a univariate sample XI"'" X n , one might remove three
blocks, leaving a tolerance region of n - 2 blocks, as follows. First, make
cuts at the smallest and largest observations, X(I) and X(n), and remove the
two blocks (- 00, X(1) and (X(n) , 00). (This is equivalent to using the cutting
function cf>(x) = x, or any strictly monotone function thereof, for each of the
first two cuts.) Let
130 2 One-Sample and Paired-Sample Inferences Based on the BinomIal DIstribution

and make the third cut at the largest value of ¢3(X(i», i = 2, ... , n - 1. Then
the tolerance region will be the interval from
X(I) + X(n) - X(n-l) to if X(2) - X(l) > X(n) - X(n-I)
X(n-l)

X(2) to X(n) - X(2) + X(l) otherwise


(Problem 76). The rationale for such a procedure might be that excluding
three blocks would more nearly permit the desired tolerance proportion and
confidence level than excluding two or four, and this is a more symmetric
procedure for excluding three blocks than using always X(1) to X(n-l) or
always X(2) to X(n)'
More elaborate univariate procedures, including this as a special case, are
discussed by Walsh (1962b). The possibilities of generalization seem to have
been opened up in Scheffe and Tukey (1945) and Tukey (1947). Further
discussion and references are given by Fraser (1957) and Guttman (1970).

PROBLEMS

1. Show that, if F(x) = p for a < x < b, then a and b are both pth quantiles of F.
2. Show that ¢ is a pth quantile of F if and only if the point (¢, p) lies on the graph of
F with any vertical jumps filled in.
3. Show that ¢ is a pth quantile of a distribution if and only if, when ~ is included, the
left tail probability is at least p and the right tail probability IS at least 1 - p.
4. (a) Sketch seven c.d.f.'s to exhibit the seven possible combinations of cases (a)-(c)
of Sect. 2.
(b) Which combinations are possible for
(i) discrete distributions?
(ii) distributions with densities?
5. What is the relation between the quantiles and the inverse function of a c.dJ. ?
6. An estimator is called median unbiased for a parameter if its median is that para-
meter. Show that, for odd sample sizes, the sample median is a median unbiased
estimator of the population median.
7. (a) If ¢p is a pth quantile of a distribution on a finite set, show that either p or ¢p
IS not unique. Relate this to cases (a)-(c) of Sect. 2.
(b) For what countably infinite sets does (a) hold?

8. A constant problem in manufacturing plants is machine breakdowns. Time


consumed while the machine is being repaired is called "down time." Both thet
expense of repair and amount of down time need to be kept small for efficient and
profitable production. The managers of a plant are considering replacing the type
of machine currently in use. The decision should depend on the costs of purchase
and changeover, in addition to a comparison of down time. To get data as input
for thiS decision, twenty machines of the new type are leased and observed for a
fixed period of time. Only one machine had a breakdown. The probability of a
breakdown is known to be 0.10 for the machine previously used. Does the replace-
ment machine seem to have a smaller probability of breakdown? Fmd the P-value.
Problems 131

9. It is frequently claimed that working together on a common project makes people


like each other more. A sociologist ran an experiment to test the null hypothesis that
no systematic change In people's friendliness occurs through joint participation,
against the alternative that people become friendlier. Twenty-five pairs of indivi-
duals were selected at random, each pair was observed together In several situations
for one week, and notes made on their relationship. After each pair had worked
together on a single project for one week, their relationship was observed again.
Fifteen pairs were noted as having a friendlier relationship after the project.
Perform a statistical test, and comment on its appropriateness and limitations.

*10. Let XI' X 2 , ••• , Xn be independent, identically distributed m-vectors and let
P = P(X) E A) for A a given set in m-space. Find a uniformly most powerful test
of the null hypothesis H 0: P s Po against the alternative HI: P > Po [Lehmann,
1959,p.93].

*11. Prove that any two-tailed sign test is admissible if X I, .•. , Xn are independent and
identically distributed with P(X) = ~o) = 0, where ~o is the hypothesized median
value.

*12. Prove the optimum properties of the sign test for Ho: P = Po given in Sect. 3.2 for
the case where the observations are not necessarily identically distributed but are
independent with P(X) < ~o) = P(X) s ~o) = P for allj.
*13. Prove that S, the number of observations below ~o, is a sufficient statistic for P
if X I' ... , X n are independent and identically distributed with distributIOn F
belonging to the family defined in Sect. 3.3, paragraph 2.

*14. Show that, if X I, X 2, Xn are independent and identically distributed with


..• ,
P(X) < ~o) = P(X) s ~o) = p, then
the unbiased, two-tailed sign test of the null
hypothesis H 0: p = Po is uniformly most powerful unbiased against HI: p i:- Po.
*15. Show that, in Problem 14, the two-tailed, symmetric, level IX sign test is most
powerful among all symmetric, level IX tests.
*16. Prove Theorem 3.1, which gives conditions under which a test is most powerful.

17. An automobile manufacturer wishes to design a certain new model such that the
front-seat headroom is sufficient for all but the tallest 5 %ofmaledrivers. A random
sample of 100 male drivers is taken. The heights of the 9 tallest are as follows:

70.1, 72.3, 71.9, 70.5, 73.4, 76.1, 74.5, 70.9, 75.8.

(a) Find a 90 % two-sided confidence interval for the 95th percent point of the
population of male drivers.
(b) Former studies by the Federal government have shown that the 95th percentile
point for height of U.S. males is 70.2 inches. Does this result appear to be valid
now and for the population of male drivers?

18. A sample of 100 names was drawn from the registered voters in Appaloosa County
and sent questionnaires regarding a proposed taxation bill. Of the 75 usable
returns, 50 were in favor of the bill. Find a 95 % confidence interval for the true
proportion of registered voters in favor of the bill. What assumption are you
making about the unusable returns?
132 2 One-Sample and Paired-Sample Inferences Based on the Binomial Distribution

19. Suppose that a quantile of order P is not unique for fixed p, that is, ~p is any value
in the closed interval [~~, ~;] for some ~~ < ~;. Show that if lower and upper
confidence bounds, say Land U, are determined by the usual sign test procedure,
each at level 1 - cx, then
P(L S ~~) ~ 1 - cx and P(U ~ ~;) ~ I - cx.
20. (a) Suppose that L is a lower confidence bound for a pth quantile ¢p constructed
by the usual sign test procedure at exact level 1 - cx for a continuous popula-
tion. Show that if the population is discontinuous, the lower confidence
bound has at least the indicated probability of falling at or below ¢p, and at
most the indicated probability offaIling strictly below ¢P' that is

P(L s ¢p) ~ 1- IJ. ~ P(L < ¢p).


(b) Show that a corresponding statement holds for the upper confidence bound.
(c) Show that the two-sided confidence interval which includes its endpoints has
at least the indicated probability of covering ~p (and hence at most the indi-
cated error rate), while the interval excluding its endpoints has at most the
indicated probability of covering ~p.
21. Show that
(a) If X has a continuous, strictly increasing c.dJ. F, then Y = F(X) has a uniform
distribution over (0, 1). (This is the fundamental property of the "probability
integral transformation" F(X), named thus presumably because F is the
integral ofthe density of X when it has one.)
(b) Conversely, if Y has a uniform distribution over (0, 1) and F is a continuous,
strictly increasing c.dJ., then X = F-I(y) has c.dJ. F.
(c) In (b), if F is any c.dJ. whatever and X is any quantile of order Yin F, then X
has c.dJ. F.
22. Suppose that four independent, dichotomous trials are observed, with true
probabilIty Pj of success on the jth tnal, j = 1, 2, 3, 4. Let P = L
pi4. If the upper
confidence limit for P is taken to be 0.44, 0.68, 0.85, 0.97, and 1.00 respectively
when there are 0, 1, 2, 3, and 4 successes, graph the true confidence level as a function
of P for each of the following situations:
(a) PI = P2 = P3 = P4 = P
(b) PI = P2 = P - 0.1, P3 = P4 = P + 0.1
(c) PI = 0, P2 = P3 = P4 = 4p/3
(d) PI = 1, P2 = P3 = P4 = (4p - 1)/3.
Be sure to study what happens in the neighborhood of P = 0.44, 0.68, 0.85, 0.97;
a few other values of P will suffice. (Notice that the procedure is conservative.
See Hoeffding [1956].)
23. Consider "fixed effects" models with PJ = P(XJ < ¢o) arbitrary except that
P= LJ pin is given. Show that, for sign tests with critical values s~ and s;:
(a) If P ~ (s~ + l)/n and/or p s (s~ - l)/n as appropriate, then there exist
values of Pj for which the probability of "acceptance" is 1 (and hence the
power is 0 if P -:j: Po).
(b) If P ::-:; s~/n or P ~ s;/n, then there exist values of PJ for which the probability
of rejection is 1.
Problems 133

24. Let X(r) denote the rth from the smallest in a random sample of size 5 from any
continuous population with ~p the quantile of order p. Evaluate the following
probabilities:
(a) P(X(I) < ~o so < XIS»~
(b) P(X(I) < ~o 25 < X(3»
(c) P(X(4) < ~0.80 < XIS»~'
25. If X(1) and X(n) are respectively the smallest and largest observations in a random
sample of size n from any continuous distribution F with median ~o so, find the
smallest value of n such that
(a) P(X(I) < ~o so < X(n» :2: 0.95
(b) P[F(X(n» - F(X(1» ~ 0.50] ~ 0.95.
26. Let V denote the proportion of the population lying between the smallest and
largest observations in a random sample of size n from any continuous population.
Find the mean and variance of V.
27. In a random sample of size n from any continuous population F, the interval
(X(r), X(n-r+ I» for any r < n/2
gives a level 1 - IX confidence interval for the
median of F. Show that IX can be written as

IX = (0.5r /I (n)
k=O k
= 2n(n - 1) fO soxn- r(1 _ xy-l dx.
r - 1 0

28. Show that the exact confidence level of a confidence interval for the median
with endpoints the second smallest and second largest observation is equal to
1 - (n + 1)/2n - 1

for a random sample of n observations from any continuous population.


29. Show that if X(I) < ... < X(n) are the order statistics of an independent random
sample from a continuous distribution with density f, then the joint density of the
order statistics is

n !(x,)
n
n! for XI < X2 < ... < x n·
1= 1

For example, the density of the normal order statistics is

30. Let X(r) be the rth order statistic of a random sample of size n from a population
with continuous c.dJ. F.
(a) Differentiate (4.1) to show that the marginal density of XI') is

g(x) = r(~)[F(X)]'-I[1 - F(x)]n-'f(x).

(b) Show that the c.dJ. of the density in (a) can be written as

G(t) = P(X(r) ~ t) = I 0
F(')
[B(r, n - r + 1)] -1 11,- 1(1 - IIr' dll,

and hence this binomial sum is equivalent to the incomplete beta c.dJ. above.
(c) Integrate (a) by parts repeatedly to obtain the binomial form in (4.1).
134 2 One-Sample and Paired-Sample Inferences Based on the BInomial Distribution

31. By considering P(X(,) > tin) in the binomial form given in (4.1), find the asymp-
totIc distribution of XI') for r fixed and n -+ 00 if
(a) F is the uniform distribution on (0, 1).
(b) F is an arbitrary continuous c.dJ.
32. Let X(n) denote the largest value in a random sample of size n from the population
with density function!
(a) Show that lim n _ oo P(n- IX(n) S x) = exp( -rx/nx) if f(x) = rx/[n(rx2 + x 2 )]
(Cauchy).
(b) Show that limn _ oo P(n- 2 X(n) S x) = exp(-2J2/nx) iff(x) = (rx/fo)X- 3/2
exp( -rx 2/2x) for x ~ O.
33. Show that the joint density of two order statistics XI')' XIV)' 1 s r < v s n, of an
independent random sample from a population with continuous c.dJ. F, is

g(x, y) = n(n - 1) ( n- 2 ) [F(x)r I [F(y) - F(x)]v-,-I


r - 1, v - r - 1, n - v

[1 - F(y)]n- '1(x)f(y) for x < y


where <':;.,) = m!/r!s!t! for r + s + t = m.
34. For any two order statistics XI') and XIV) of an independent random sample from
a population with continuous c.dJ. F, where I' < v, show that the event XI') < p e
e e
occurs if and only if either XI') < p < Xlv) or p ~ Xlv)' Since these latter two
events are mutually exclusive, it follows that
P(X(,) < e < Xlv»~ = P(X(,) < e
p p) - P(X(v) S e
p )'

This result can be expressed in binomial form as in (4.3), or, by Problem 30(b), it
can be written as the difference

s:[B(I', n - r + l)r IU,-I(l - ur' du

- I:[B(V,n - v + l)r l uV - I (I- urvdu.

35. Let X(I) < ... < X(n) be order statistics of a random sample of size n from the
exponential density f(x) = e- x , x ~ O.
(a) Show that XI') and Xlv) - XI') are independent for any r < v.
(b) Find the distribution of X(r+ I) - XI')'
(c) Interpret the significance of these results if the sample arose from a life test of n
items with exponential lifetimes.
36. Find the density and c.dJ. of the range, X(n) - X(1)' of a random sample size n
from any continuous populatIOn.
37. Suppose we want to use a random sample of size n to find a level 0.95 confidence
interval for 6 in the density f(x) = exp[ -(x - 6)] for x > 6. Since the smallest
observation X(I) is a sufficient statistic for 6 (and also its maximum likelihood
estimator), some functions of X(1) would be a natural choice for the confidence
bounds. If X(I) is the upper confidence bound, find that lower confidence bound
g(X( I) which gives a two-sided level of 0.95.
Problems 135

38. Suppose we have a normal population with unknown mean and median ¢ and
known variance u 2 and we require a test of the null hypothesis H 0: ¢ = ¢o against
the simple alternative ~ = ~I> where ~I > ~o, such that the level is IX and the power
is 1 - p. Let IX and p be fixed while (~I - ~o)/u = 0 approaches O.
We consider two different tests for the situation described above. Test A is the
appropriate normal theory test for a sample of sIze "A, that is, the test based on
Z = In:(X - ¢o)/u with a right-tail critical value z. from Table A. Test B is the
sign test for a sample of size liB, that is, the test based on S, the number of observa-
tions smaller than ~o.
(a) Obtain a general expression for the sample size "A
required by the normal
theory test.
(b) Obtain an approximate expression for the sample size liB required by the sign
test, by approximating the binomial distribution by the normal distribution.
(c) The limit of the ratio IIA/IIB as 0 ..... 0 is known as the asymptotic efficiency of
the sign test relative to the optimum normal theory test. Using (a) and (b),
show that the asymptotic relative efficiency equals 2/n. This example and
others will be discussed at length in Chap. 8.

39. (a) Show that the value of illl such that X(I) is a lower confidence bound for the
median at conservative level 0.95 is 0.28 when II = 9, and 0.32 when II = 17.
What is the true level in each case?
(b) Find the true level of X(I+ I) as a lower confidence bound for the median
when illl = 0.28, 11= 9 and when i/II = 0.32, II = 17.
(c) Since the true levels in (a) and (b) are approximately equidistant from 0.95
for each II, one might consider using (X(I) + X(I+ 1)/2 as a lower confidence
bound for the median. If the population is symmetric, what is the true level of
this bound when II = 9, and when II = 17?

40. Show that the following pairs of hypotheses are equivalent:


(a) max ~po < ~o and P< > Po
(b) min ~po ::::; ~o and p"" ~ Po
(c) min ~po > ~o and p"" < Po
(d) max ~po ~ ~o and P< ::::; Po·
Here max ~po and min ~po denote the maximum and minimum possible values
of ¢po as required if ¢po is not unique.

41. Use the results of Problem 40 to show that ~po = ~o for all Po points if and only if
p < ::::; Po ::::; p"", and thus that the two-sided test that rejects if S "" ::::; SI and also If
S < ~ Su is appropriate for a two-sided test of the null hypothesis that ~o is a Po-
point when one or more observations is equal to ~o.

42. (a) Show that

E(S< - S» = II(P< - P»

var(S< - S» = II[P< + p> _ (p< _ p»2].

(b) Show that if P < + p> = 1 then S < - S> is even with probability 1 for II even,
odd for II odd.
136 2 One-Sample and Palfed-Sample Inferences Based on the BmomIaI DistrIbutIOn

(c) Show that S< - S> IS asymptotically normal with the mean and variance
given In (a), even if the parameters P< and P> depend on n, provided that this
vanance approaches infinity.
(d) Derive approximate tests and confidence bounds for P< - P> and show
that their levels are asymptotically valid.

43. (a) Given any c.dJ. G and any positIve P<, P> with p< + p> :$; 1, show that
there is exactly one c.dJ. F having the same condItional distributions as G
on each side of ~o, PF(X < ~o) = P<, and PF(X > ~o) = p>, and express F
algebraically in terms of G, p<, and p>.
(b) Show that the family of such distributions F includes G and includes a dIs-
tribution with p< = p> for each p<, 0 < P< < t.
(c) Show that the statistics Sand N are jointly sufficient for this family, where S
and N are the numbers of observations in a sample which are respectively < ~o
and :$; ~o'
(d) Show that, for the subfamily of such distributions F wIth p< = P>, the statistic
N is sufficIent.

44. (a) Show that a conditional test at conditional level Ct. is an unconditional test at
level Ct..
(b) Show that a conditionally unbiased test is unconditionally unbiased.
(c) Show that if all unbiased tests against a certain alternative are conditional
and if a certain conditional test has uniformly greatest conditIOnal power,
then this test is umformly most powerful unbIased against this alternative.
*45. Prove Theorem 6.1 relating unbiased and conditional tests.
46. In the situation and notation of Sect. 6, for n = 4, let cjJ(S < , N) be defined by the
accompanying table, with cjJ(S < , N) = 0 elsewhere

N 0 2 3 4
S< 0 0, 1 0, 2 1, 2 0, 4
cjJ k k !! I

(a) Show that the test cjJ has conditional level kfor each N.
(b) Show that cjJ is biased conditional on N = 3.
(c) Show that cjJ has power

i(po + PI) + !P2[P2 + (l - p)2] + tp3P(1 - p) + P4[p4 + (1 - p)4]

where Pili = P(N = m) = (!)r'''(l - /,)4-111 with r = p< + p> and p = p</r.
*(d) Show that the conditional power for N = 4 exceeds that for N = 2, I.e.,
p4 + (I - p)4 > Hp2 + (I - p)2] for p of t.
(e) Show that the conditional power for N = 2 and that for N = 3 average to
k, i.e., k[p2 + (I - P)2] + !p(l - p) = l
*(f) Show that P3 :$; P2 + P4'
*(g) Use these facts to show that cjJ is unconditionally unbiased.

*47. Fill in the details of the proof that the equal-tailed conditional sign test is uniformly
most powerful unbiased when ~o has positive probability by showing the results
stated in (6.11)-(6.13).
Problems 137

48. Suppose that a pair of control and treatment measurements (V, W) would be
permutable if the treatment had ,no effect, while the effect of the treatment is to
add a constant amount p to W. Show that the median of the difference X = W - V
isp.
49. Suppose that a confidence interval for the median IS obtained by applying the
methods of Section 4 to the differences X, = W. - V, of independent, identically
distributed pairs (V" W,). Show that the confidence procedure is valid for the" shift"
o which makes P(W, - 8 < V,) and P(W, - 8 > V,) both no larger than 0.5. (In
the continuous case, W. - 0 is equally likely to be less than or greater than v,.
e
Nevertheless, need not equal the difference between the medians of W, and v,;
see the following problems and Sect. 2 of Chapter 3.)
50. This problem and the next one show that the population medians of the treatment-
control differences in matched-pair experiments may be all positive or all negative
even though the treatment has no effect on the mean or median for any unit. If
the dispersion is larger for larger-valued units, and if the treatment accentuates
this, then the treatment-control differences typically have negative medians al-
though they may have positive medians. The following problem gives a similar
result for skewness.
Suppose that each unit possesses a "unit effect" u and would yield the observed
value U = u if untreated but T = u + ,(u)Z if treated, where ,(u)Z is a "random
error," independent from unit to unit, with positive scale factor ,(u). Suppose that a
pair of units is given whose unit effects are u' and u", with 1/' < u", say; then one
of the two units is chosen at random for treatment. Let X be the treatment-control
difference observed, and let (j = 1/" - u', " = ,(u'), and ," = ,(u").
(a) Show that P(X :::;; x) = tp«(j + ,"Z :::;; x) + tp( -(j + ,'Z :::;; x).
(b) Suppose that P(Z = -1) = P(Z = 1) = 0.5. Show that the possible medians
of X are all positive if ," < (j < ,', all negative if " < (j < ,", and include 0
otherwise. [Hint: P(X = - (j - ,') = P(X = - b + t') = P(X = (j - ,") =
P(X = (j + ,") = 0.25.]
(c) Suppose that ,(u) is an increasing function of 1/ and that Z is symmetrically
distributed about zero, that is, P(Z ~ z) = P(Z :::;; - z) for all z. Show that
P(X < 0) ~ 0.5 with equality holding if and only If P«(j/," :::;; Z :::;; (j/,') = O.
(In the case of inequality, the median of X must be negative.)
51. Suppose that the situation of the previous problem holds except that U is dis-
tributed as 1/ + v(u)Z with v(u) > 0, and the pairs (U, T) are independent from
unit to unit. (For anyone unit, only the marginal distributions of U and T matter.
Beyond this their joint distribution is irrelevant.) Let v' = 1'(1/') and v" = 1'(1/").
(a) Show that P(X :::;; x) = tp(,"Z" - v'Z' :::;; X - 8) + tp(,'Z' - v"Z" :::;; X +
(j) where Z' and Z" are 1l1dependently distributed as Z.
(b) Show that P(X < 0) > 0.5 if Z is normal with mean 0 and the treatment
variance mulUS the control variancc, ,2(1/) - 1'2(1/), IS an 1l1creaS1l1g function
of the unit effect 1/
*(c) Show that if Z is uniformly dlstnbuted over the interval [ - R R], ," - v" ~
" - v', and (v"," - v'r')(i," - 1"1''') > 0, then P(X < 0) ~ 0.5 wIth equality
if and only If R :::;; 8/(1" + r").
(d) Show that, and v satisfy thc conditIOns of (b) and (c) if ,(1/) - 1'(1/) IS a positive,
increas1l1g function of 1/ and 1'(1/) is nondecreasing.
138 2 One-Sample and PaIred-Sample Inferences Based on the Binomial DIstribution

(e) Show that P(X :::; 0) = ~ if P(Z = z) = t for z = -1,0, 1 and v' < (j < ,," <
r' + E> and (j < r' < r" < v' + E>.
If v(u) and r(u) - v(u) are increasing functions of u, parts (b)-(d) illustrate that
the median of X will typically be negative, although it may be positive, even for
symmetric Z, since these conditions do not preclude the conditions of part (e).
52. In the situation of the previous two problems, replace the assumptions about
U and Tby E(U) = E(T)= u(so the treatment has no effect on the mean), var(U)=
V2(U) and var(T) = r2(u) are finite, and E(U - U)3 = E(T - U)3 = 0, with (U, T)
stIlI independent from umt to unit. Show that E(X) = 0 and E(X3) = 1.5E>(r,,2 -
1'''2 - r'2 + 1"2). Therefore the distribution of X IS skewed to the right if r2(u) -
V2(U) is an increasing function of u, even though U and T have no skewness.

53. In the paired situation of Sect. 8, define a difference score for each pair as the
score on II minus the score on I. Show that the probability of a positive difference
score minus the probability of a negative difference score equals PII - PI' where
PI and PII denote the probabilities of a score of 1 on I and II respectively.

54. Suppose that one male and one female chick are selected at random from each of
10 litters and inoculated with an organism which is thought to produce an equal
chance of life (1) or death (0) for every chick. The death occurs within 24 hours
if at all, and the organism has no effect on life after the first 24 hours. Test the data
below to investigate whether the sexes differ in their response to the orgamsm.

Litter Male Response Female Response

1 1 0
2 0 0
3 1 1
4 1 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0
10 0

55. Prior to a nationally televised series of debates between the presidential candidates
of the two major parties, a random sample of 100 persons were asked their pre-
ference between these candidates. Sixty-three persons favored the Democrat at
this time. After the debate the same 100 people were asked their preference agam,
and now seventy-two favored the Democratic candidate. Of these, 12 had pre-
viously stated a preference for the Republican candidate. Test the null hypothesis
that the voting preferences were not significantly changed after the debate. Can
you say anything about the effect of the debate?
56. In a study to determine whether constant exposure of children to violence on TV
affects their tendency toward violent behavior and possibly crime, disturbance,
etc., a group of 100 matched pairs of children were randomly selected. The pairs
Problems 139

were formed in the population by matchmg the children as well as possible as


regards home environment, genetic factors, intelligence, parental attitudes, ctc.,
so that differences in other factors influencmg aggressive behavior could be
minimized. In each pair, one child was randomly selected to view routinely the
most violent shows, while the other was permitted to watch only "acceptably
nonviolent" shows. Psychologists rated each child's tendency toward aggressIOn,
both before and after the experiment, and noted whether the children exhibited
more (1) or the same (or less) (0) tendency to aggression. The numbers of children
in these rating groups are shown below. Analyze the data to test
(a) The hypothesis that the proportions of children exhibiting more tendency to
aggression is the same regardless of what kind of TV shows they watch.
(b) The hypothesis that the proportion of children "switching" their aggression
pattern m each direction IS the same. Find a confidence interval at level 0.90 for
the difference of these proportions. Comment on the meaning of this dif-
ference.

Score of Children Exposed to


Violence Nonviolence Number of Pairs

1 1 18
1 o 43
o 1 8
o o 31

57. Suppose that an achievement test is divided into k parts, all covering the same type
of achievement but using different methods of investigation. All parts of the test
are given to each of /' subjects and each IS given a score of pass or fail on each part
of the test. The data might be recorded as follows, where 1 represents pass and 0
represents fail.

Score on Score on Score on


Subject part 1 part 2 part k R

0 1 Rl
2 1 0 R2

/' 0 0 R,

C1 C2 Ck N

The symbols R I , ... , R, and C\>oo., C k represent the row and column totals
respectively, so that R, is the number of parts passed by subject i and C) is the
number of subjects who passed part j. N is the total number of parts passed by all
subjects.
Let p') denote the probability that the score of the ith subject on the )th part of
the test will be a 1 (pass). Suppose we are interested in the null hypothesis

Ho: Pil = Pi2 = ... = P,k for i = 1, ... , r.


140 2 One-Sample and Paired-Sample Inferences Based on the Binomial DistributIOn

This says that the probability of a pass is the same on all parts of the test for each
subject, or that the parts are of equal difficulty. The Cochran Q test statistic is
defined as
Q = k(k k
- l)I(C)
1
I
r
- Nlk)2 IR,(k
1
- R,).

(a) Give a rationale for the use of the quantity Q to test the hypothesis of interest.
What is the appropriate tail for rejection?
(b) While the exact null distribution of Q conditional on R., . .. , Rr can be gen-
erated by enumeration, it is time-consuming and difficult to tabulate. Hence a
large sample approximation is usually used instead unless r is quite small,
specifically the chi square distribution with (k - 1) degrees of freedom. Give a
JustIfication of this approximation along the following lines. Within each
column, the observations are Bernoulli trials. Hence for each j, C) follows the
binomial distribution with mean I~=l PI) and variance I~=l p,i 1 - P,),
Under the null hypothesis, this mean and variance can be estimated from the
sample data as Nlk and I~=l (R,/k)[1 - (R,/k)], but this latter estimate is
improved by multiplying by the correction factor kl(k - 1). Standardized
variables are then squared and summed to obtain Q. While the C) are not
independent, as r increases the C) approach independence. One degree of
freedom is lost for the estImation procedure.

tC; - I
(c) Show that the test statistic Q above can be written in the equivalent form

Q = (k - 1)(k N 2) (kN - ~ R~ )
which is easier for calculation.
(d) Show that when k = 2, the test statistic Q reduces to
Q = (C - B)2/( C + B)
where C and B are defined as in Table 8.1, that is, C is the number of (0, 1)
pairs and B is the number of (1,0) pairs. Hence the test statistic Q when k = 2
is equivalent to the test statistic for equality of paired proportions presented
in Sect. 8.1.
58. Forty-six subjects were each given drugs A, Band C and observed as having a
favorable (1) or unfavorable (0) response to each. The results reported in Grizzle
et al. [1969, p. 494] are shown in the table below.

Response to Drug Number of


ABC Subjects

1 1 1 6
1 1 0 16
1 0 1 2
o 1 1 2
1 0 0 4
010 4
o 0 1 6
000 6

Test the null hypothesis that the drugs are equally effective.
Problems 141

59. Four different surgical procedures for a duodenal ulcer are A (drainage and
vagotomy), B (25% resection and vagotomy), C (50% resection and vagotomy)
and D (75 % resection). Each procedure was used for a fixed period of time in
each of 15 different hospitals, and an overall clinical evaluation made of the
severity of the "dumping syndrome," an undesirable aftereffect of surgery for
duodenal ulcer. The overall evaluation was made as simply" not severe" (0) or
"present to at least some degree" (1). Analyze the results below for any significant
difference between aftereffects of the four surgical procedures.

Surgical Procedure
Hospitals A B C D

1,7,8, 11 1 0 1 0
2,3,13 0 0 1 0
4,10 1 1 0 1
5, 12 1 1 0 0
6 0 1 0 1
9, 14, 15 1 0 0 0

60. Suppose that for a group of n matched pairs of individuals, one member of each
pair is selected to receive a treatment for a certain period while the other serves as
a control (is untreated or given a placebo). Each individual is measured both
before and after the treatment period so that there are a total of 4n observations.
Indicate how this can be reduced to a matched pair situation while making use of
both before-after and treatment-control information. If all 4n observations mea-
sure only the presence or absence of response, can the procedures of Sect. 8.1 be
applied? (Hint: How many different response categories are there for the dIffer-
ences?)
*61. Suppose that half of a given group of patients is selected at random to receive
Drug I at a certain time and Drug I I at a later time, while the remaining patients
receive Drug I I first and Drug I second. On each occasion, the characteristic
measured is dichtomous. Suppose further that the two drugs are known to have
exactly the same effect on all patients but there is a time effect. Show that a test for
equality of proportions using matched observations ignoring the time effect is
conservative. (Of course one would expect to obtain better power from a suitable
test which takes into account the time effect. See also the end of Sect. 3.1.)

62. Suppose that the randomization in Problem 61 is carried out within pairs of
patients rather than over the whole group of patients. How does the situation then
relate to that of Problem 60?

63. Prove the test properties stated in Sect. 8.5.

64. In the paired dichotomy situation and notation of Sect. 8.6, derive tests and con-
fidence procedures for
(a) the parameter A = POl/PlO'
(b) the parameter}, = p,,/p,.
142 2 One-Sample and Paired-Sample Inferences Based on the BinomIal Distribution

65. In the paired dichotomy situation and notation of Sect. 8.6, show that inde-
pendence of score on II and sameness is equivalent to PllPIO = POIPOO'
66. In the paired dichotomy situation and notation of Sect. 8.6, show that if sameness
IS independent of score on I and of score on II, then either P(same) = 0 or
P(different) = 0 or PII = POO and PIO = POI'
67. Show that, under the model of Sect. 8.7 for paired dIchotomous observations,
a one-sided test of P = 0.5 as described in Sect. 8.1 (or 0 = 00 as described in
Sect. 8.6) is uniformly most powerful against a one-sided alternative. Show also
that the related unbiased tests are uniformly most powerful unbiased against the
alternative P '# Po·
68. (a) Let c be the coverage of a specIfied (nonrandom) region R. Suggest non-
parametric tests and confidence procedures for c.
(b) What optimum properties would you expect your tests in (a) to have?
(c) Show that the optimum properties you stated in (b) hold.
69. Suppose a tolerance region with tolerance proportion 0.90 and confidence level
0.95 is set up for a production process. Thereafter, pairs of items are observed,
and trouble-shooting is undertaken whenever both fall outside the tolerance
region. What can you say about the amount of trouble-shooting required when the
process is in control?
70. Let C j , ••• , C n + I be the coverages of a sample of II from a continuous c.dJ. F.
Let Vs = I1 c" s = 1, ... , II.
(a) What is the joint density of V ... .. , V,,? Of C., . . , en?
(b) What is the joint density of V" V J for i '# j? Of C" C j for i '# j?
71. Let C I, ... , Cn + I be the coverages of a sample of II from a contmuous c.dJ. F,
and Vs = I1 C, for s = 1, ...• II. Show that the following properties hold.
(a) The joint distribution of C I, .. .J, Cn + I does not depend on F.
(b) The random variables Vs are the order statistics of a sample of II from the
uniform distribution-on (0. 1).
(c) Given V I, ... , Vs-I' Vs+ I' ... , Vn' the conditional distribution of Vs is
uniform on the interval (V,_ j, Vs+ d.
(d) Any permutatIOn of C I" .. , Cn + I has the same joint distribution as C., . .. ,
Cn + l ·
(e) E(C,) = 1/(11 + I) for all i.
(f) The correlation of C, and C J is -1/11 for all i,j with 1'# j. (Hint: This can be
shown without integration).
(g) The sum of any s coverages is distributed like Vs (which has the beta distribu-
tion (9.3) by Problem 30).
(h) If m intervals (X(,_I)' X(,) are excluded, the probability that the coverage of
the remaining region is at least c is the probability of m or more successes
under the binomial distribution with parameters II and p = 1 - c.
(i) The conditional distribution of V I" .. , Vs given Vs+ I is that of the order
statistics of a sample of s from the uniform distribution on (0, V,+ I)'
(j) The random variables C; = C./V s + I = C./(C I + ... + C s + .), i = I, .. . ,s + I
are jointly distributed like the coverages of a sample of s from a continuous
distribution.
(k) GIven V 3, V 7 , the conditional distribution of V I, V 2, V 4. V 5, V 6, V 8.··· ,Vn
(1/ > 7) IS that of the order statiStiCS of three independent samples, one of two
observations from the uniform dlstnbutlOn on (0, V 3), one of three observatIOns
from the uniform distribution on (V 3 , V 7 ), and one of II - 7 observations
Problems 143

from the uniform distribution on (U 7' 1). (Note that this generalizes to any
set of given U's.)
(I) The coverages C, have the same joint distribution as the random variables
Z,/(ZI + ... + Z"+I) where the Z, are independently distributed with
density e- z , z ~ O.
72. In a sample of size II from a contmuous distribution, let I I be the interval between
the smallest and largest observation and 12 be the interval between - 00 and the
next-to-largest observation. Show that
(a) II and 12 are both tolerance regions with tolerance proportion 0.5 and con-
fidence level 1 - (II + 1)/2".
(b) II is a confidence interval for the median with confidence level 1 - 1/2"- I.
(c) 12 is a confidence interval for the median with confidence level 1 - (n + 1)/2".
(d) II and 12 are tolerance regions with tolerance proportion 0.75 and confidence
level 1 - (II + 3)3"-1/4".
(e) II is a confidence interval for the upper quartile with confidence level 1 -
(1 + 3")/4".
(f) 12 is a confidence interval for the upper quartile with confidence level
I - (II + 3)3"-1/4".
73. (a) Show that Murphy's graphs (Fig. 9.1) give the upper confidence limit for a
binomial parameter p, as a function of n, for a fixed number of successes m
and confidence level 1 - a.
(b) To what extent do you agree with the accompanying estimates of ease and
accuracy of using various types of tables and graphs for various purposes?
Make reasonable assumptions about the grids of values employed, etc. The
notation is that 1 - a is the probability of r or more successes in n binomial
trials with parameter p.

Tables Graphs
Direct binomial Fisher-Yates Clopper-Pearson Murphy

Size of table or large small medium medium


graph
Given 1', II, p, find low accuracy essentially impossible
a (binomial usually easy,
tail probability) high accuracy
often hard
Given 1', II, a, find low accuracy easy and low accuracy usually easy,
p (binomIal usually easy, accurate high accuracy impossible
confidence high accuracy
limit) usually hard
Given p, II, a, usually easy, hard, easy
find I' accurate accurate usually accurate
(bmomial
critical value)
Given p, 1', a, usually easy, hard, medium easy
find II difficulty
accurate accurate usually accurate
144 2 One-Sample and Paired-Sample Inferences Based on the BinomIal Distribution

74. (a) Verify that the coverage of a 0.65 tolerance region with confidence 0.95 based
on 20 observations has the distribution graphed in Fig. 9.2.
(b) Show that the same region has coverage c with confidence 1 - ex for any
values c and 1 - ex related according to the graph.
75. How could a tolerance region be defined for all sample sizes n ;;:: 10 so that the
relationship between the tolerance proportion and the confidence level is the
same for all n?
76. (a) Show that the region obtained from the cutting functions 4>1(X) = X,4>2(X) = x,
4>3(X) = Ix - !(X(I) + X(n») I is a valid univariate tolerance region.
(b) Which of the generalizations discussed in Sect. 9.6 are adequate to cover this
case, and how do they do so?
CHAPTER 3
One-Sample and Paired-Sample
Inferences Based on Signed Ranks

1 Introduction
In Chap. 2 we saw that the sign test is the best possible test (in strong senses
of "best ") at level a for a null hypothesis which is as inclusive as the statement
"the observations are a random sample from a population with median 0
(or ~o)." It certainly seems as though better use could be made of the observa-
tions by taking their magnitudes into account. However, since the sign test
is optimum at level oc for this inclusive set of null distributions, any procedure
which considers magnitudes would be a better test at level a only for some
smaller and more restrictive set of null distributions. Such procedures are of
special relevance if the restricted set is the one of interest anyway, or if the
"restriction" of the null hypothesis can reasonably be assumed as a part of
the model, so that the restricted hypothesis essentially amounts to the
unrestricted one above. Furthermore, their exact levels may vary only
slightly under the kinds of departure from assumptions which are likely,
and their increased power may well be worth the small price. Recall that, in
principle, level and power considerations should always be balanced off.
This chapter will present and discuss tests and confidence procedures
suggested by the more restrictive hypothesis that the observations are a
random sample from a popUlation which is symmetric with median 0 (or
any other value). (Symmetry will be defined precisely below.) Other related
hypotheses will also be considered. All of the inference procedures presented
are based on what are called the" signed ranks" of the observations, and the
primary emphasis is on the well-known Wilcoxon signed-rank test. All of the
discussion here will be in the context of observations in a single random
sample. However, as in the last chapter, these procedures may be performed
on the differences of paired observations, like treatment-control differences

145
146 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

in matched pairs, without any change in techniques or properties as long as


it is remembered that the relevant distribution is the distribution of differences
of pairs.

2 The Symmetry Assumption or Hypothesis


We noted in Chap. 1 that the normal distribution is symmetric about its
mean, and that the binomial distribution with p = 0.5 is symmetric about n12,
but an explicit definition of symmetry was not given. A random variable X
is said to have a symmetric distribution or to be symmetrically distributed
about the center of symmetry 11 if

P(X $; 11 - x) = P(X 2: 11 + x) for all x. (2.1)

This can be written in terms of the c.dJ. F of X as

F(1l - x) = 1 - F(1l + x) + P(X = 11 + x) for all x. (2.2)

Equivalently, a distribution with discrete frequency function or density f is


symmetric about 11 if and only if (Problem 1)

f(1l - x) = f(1l + x) for all x. (2.3)

If there is symmetry, there must be a center of symmetry, 11, and this point is
both the mean (if it exists) and a median of the distribution (Problem 3).
It may be noted that (2.1) holds for all x if and only if it holds for all x > 0,
and that the two (non strict) inequalities therein may be replaced by two
strict inequalities (Problem 2).
Symmetry can be viewed in several alternative ways. Some common
conditions, each of which is equivalent to the statement that X is symmetri-
cally distributed about 11 (Problem 4), are listed below.

(a) X - 11 is symmetrically distributed about O.


(b) X - 11 and 11 - X are identically distributed.
(c) Given that IX - III = x, for any x i:- 0 the sign of X - 11 is equally
likely to be positive or negative.

The null hypothesis of interest in this chapter is that the observations are
a random sample from a population symmetric about a given value ft, that is,
the observations are independently, identically and symmetrically distributed
about 11, where 11 is specified. Since 11 is then the median, the sign test of
Chap. 2 also applies here, but the additional assumption of symmetry permits
"better" tests. These tests have corresponding confidence procedures for
the center of symmetry.
3 The Wilcoxon Signed-Rank Test 147

The procedures developed in this chapter are tests and confidence intervals
for location, where the location parameter is the center of symmetry. In
this sense, they are nonparametric analogs of the classical (normal-theory)
tests and confidence intervals for the mean. Even when the symmetry
assumption is relaxed in certain ways, the procedures of this chapter remain
valid, that is, they retain their level of significance or confidence.
In paired-sample applications with observation pairs (V, W), the symmetry
property must hold for the distribution of differences W - V. The difference
between the individual population medians of V and W is not necessarily
a center of symmetry, or even a median, of the difference W - V (Problem 6).
It is a center of symmetry when V and Ware independent and related by
translation, or when their joint distribution is symmetric (permutable,
exchangeable) in the sense that (V, W) has the same distribution as (w, V)
(Problems 7 and 8). This means that in a treatment-control experiment,
for example, randomization of treatment and control within pairs makes a
test of the null hypothesis of 0 center of symmetry valid as a test of the null
hypothesis that the treatment has no effect on any unit. (See Sect. 7, Chap. 2
and Sect. 9, Chap. 8 for more complete discussion.)

3 The Wilcoxon Signed-Rank Test


In this section we present the Wilcoxon signed-rank test for an hypothesized
center of symmetry. The test and its properties will be developed first under
the assumption that the observations X 1, ••• , Xn are an independent,
random sample from a population (distribution) which is continuous and
symmetric about J.!. In Sect. 3.5, we will see that both the assumptions of
independence and of identical distributions can be relaxed in certain ways
without affecting the level, and hence the validity, of the test. The confidence
procedures related to this test will then be discussed in Sect. 4. The continuity
assumption ensures that P(Xi = J.!) = 0 for all i and that P(X; = X) = 0
for all i :F j so that the effects of" zeros" and" ties" need not be considered
until later, in Sect. 6.

3.1 Test Procedure and Exact Null Distribution Theory

Suppose the observations X j are ranked in order of absolute value, using


positive integer ranks with 1 for the smallest IXj I and n for the largest. (While
it is not necessary, it may be more convenient to rearrange the Xj in order of
absolute value to determine these ranks.) The signed rank of an observation
is the rank of its absolute value with a plus or minus sign attached; this sign
148 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

is + if the observation is positive, and - if the observation is negative.


For example, consider the following data. 1

x) 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
rank oflXJI 11 14 2 4 5 7 9 3 8 12 6 15 13 10
signed rank 11 -14 2 4 5 7 9 3 8 12 6 15 13 -10

The signed-rank sum T is defined as the sum of the signed ranks. It may be
expressed as the positive-rank sum T+ minus the negative-rank sum T-, where
T+ and T- are the sums of the ranks of the positive and negative observa-
tions respectively. (Note that, as defined, T- is positive.) Thus, in this
example, we have
T+ = 11 + 2 + ... + 13 = 96, T- = 14 + 10 = 24,
T = T+ - T - = 72.
Each of these three statistics determines the other two by a linear relation,
so that only the most convenient one need be computed. Specifically,
because T+ + T- equals the sum of all the ranks, or 1 + 2 + ... + n =
n(n + 1)/2, we have
T+ = n(n + 1)/2 - T- = n(n + 1)/4 + T/2 (3.1)
T- = n(n + 1)/2 - T+ = n(n + 1)/4 - T/2 (3.2)
T = T+ - T- = n(n + 1)/2 - 2T- = 2T+ - n(n + 1)/2. (3.3)
Any of these three statistics may be called the Wilcoxon signed-rank statistic,
as the test may be based on anyone of them.
Under the null hypothesis that the population center of symmetry J1- is
zero, the signs of the signed ranks are independent and equally likely to be
positive or negative (Problem 9). This fact determines the null distribution
of each of the statistics. The null distribution of T+ is the same as that of
T- (Problem 10). The probability of a particular value of T+, say, is
(3.4)
where un(t) is the number of ways that plus and minus signs can be attached
to the first n integers 1,2, ... , n such that the sum of the integers with positive
signs equals t. Equivalently, un(t) is the number of subsets of the first n

I The x) are differences, in eighths of an Inch, between heights of cross- and self-fertilized plants

of the same pair, as given by Fisher [1966 and earlier editIOns]. The original experiment was done
by Darwin. Fisher's discussion IS extensive and interesting, Including pertinent quotations from
Darwin and Galton. After applying the ordinary t test to these data, obtaining t = 2.148 and a
one-tailed probability of 0.02485, Fisher introduces the method of Sect. 2.1, Chap. 4 and obtains
a one-tailed probability of 0.02634.
3 The WIlcoxon SIgned-Rank Test 149

integers whose sum is t. The values of un(t) and the probabilities in (3.4)
may be easily generated using recursive techniques (Problem 11).
The possible values of T+ (and of T-) range from 0 to n(n + 1)/2. The
mean and variance (Problem 14) under the null hypothesis are
E(T+) = E(T-) = n(n + 1)/4 (3.5)
var(T+) = var(T-) = n(n + 1)(2n + 1)/24 = (2n + 1)E(T+)/6. (3.6)
From these results and the relation (3.3), we obtain
E(T) =0 (3.7)
var(T) = 4 var(T+) = n(n + 1)(2n + 1)/6. (3.8)
The null distributions of T+, T- and T are all symmetric about their
respective means (Problem 15).
The left-tail cumulative probabilities from (3.4), that is P(T+ s:; t) =
P(T- s:; t), are given in Table D for all different integer values of t not
exceeding n(n + 1)/4 (probabilities not exceeding 0.5), for all sample sizes
n s:; 20. The tables in Harter and Owen [1970] give these probabilities for
all n s:; 50. We now describe how. these tables are used to perform the test.
If the population center of symmetry J1 is positive, one would anticipate
more positive signs than negative, and hence more positive signed ranks
among the observations, at the higher ranks as well as the lower. Such an
outcome would give larger values of T, and hence of T+. This suggests a
test rejecting the null hypothesis if T is too large, that is, when T falls at or
above its critical value, which is the upper I)(-point of its null distribution
for a test at level 1)(. By (3.1)-(3.3), this is equivalent to rejecting if T+ is too
large, and also to rejecting if T - is too small. Since Table D gives only the
lower-tailed cumulative null distribution of T+ or T-, the convenient
rule for rejection based on this table, for the alternative J1 > 0, is T- less
than or equal to its critical value. The table entry for an observed value of
T- is the one-tailed (left-tailed) P-value according to the Wilcoxon signed-
rank test.
The corresponding test against the alternative J1 < 0 rejects if T is too
small (i.e., highly negative), which is equivalent to T+ too small (or T-
too large). Hence the appropriate P-value here, the probability that T+
is less than or equal to an observed value, is again found from Table D as a
left-tailed probability.
An equal-tailed test against the two-sided alternative J1 =f:. 0 rejects at
level 21)( if either of the foregoing tests rejects at level 1)(. Since the table gives
left-tailed probabilities only, it should be entered with the smaller of T+
and T- for an equal-tailed test, and the P-value is twice the tabulated value.
In the example above we found T- = 24. According to Table D, under
the null hypothesis the probability that T- s:; 24 is 0.0206. A one-tailed
Wilcoxon test in this direction would then reject at level 0.025 but not at
0.020; an equal-tailed test would have P-value 0.0412 and would reject at
level 0.05 but not at 0.04.
ISO 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks

In order to test the null hypothesis that the center of symmetry is Jl = Jlo
for any value of Jlo other than 0, the analogous procedure is to subtract
Jlo from every X j and then proceed to find the signed ranks as before; that is,
the Wilcoxon test is applied to Xl - Jlo, ... , X n - Jlo instead of X b ... , X n'
The corresponding confidence bounds on Jl will be discussed in Sect. 4.
A different representation of the Wilcoxon signed-rank statistics will be
convenient later, although it is not convenient for hypothesis testing. We
define a Walsh average as the average of any two observations Xi' Xl' that is
(Xi + X)/2 for 1 ::::; i::::;j::::; n. (3.9)
Note that each observation is itself a Walsh average where i = j. From a
sample of n then, we obtain n(n + 1)/2 Walsh averages, n(n - 1)/2 where
i < j and n where i = j. The sign of (Xi + X j )/2 is equal to the sign of either
X, or Xl' whichever is larger in absolute value. The theorem below follows
easily from this observation (Problem 18).

Theorem 3.1. The positive-rank sum T+ and negative-rank sum T- are


respectively the number of positive and negative Walsh averages (Xi + X j )/2,
1 ::::; i ::::; j ::::; n.

Equivalent to this result is a method of expressing the Wilcoxon signed-


rank statistics as functions of the indicator variables defined by

{
I if X, + Xl > 0
(3.10)
Tij = 0 otherwise.
Specifically, we can write (3.1) as
T+ =LL T;j' (3.11)

and T - and T can be similarly represented.

3.2 Asymptotic Null Distribution Theory

For n large, T, T+, and T- are approximately normal under the null hypo-
thesis, with the means and variances given in (3.5)-(3.8). (Asymptotic
normality was discussed in Sect. 9, Chap. 1.) The normal approximation may
be used for sample sizes outside the range of Table D. The procedure for
finding approximate normal tail probabilities based on T+ or T- is shown
at the bottom of this table. When using T, the mean and variance in (3.7)
and (3.8) are appropriate. A continuity correction may be incorporated,
in the amount of! for T+ or T- and 1 for T, since T takes on only alternate
integer values (Problem 19). However, the amount of the continuity correc-
tion is small (Problem 20) and in fact it reduces the accuracy of the approxi-
mation for small tail probabilities. The approximation without continuity
correction is very good for P = 0.025, and the approximation with continuity
3 The Wilcoxon Signed-Rank Test 151

correction is very good for P = 0.05. In general, for very small P, the correc-
tion is in the opposite direction from the error in the approximation, and of
much smaller magnitude. For details and an extensive investigation of the
accuracy of critical values based on the normal approximation, see
McCornack [1965J and Claypool and Holbert [1975]. Some simple approxi-
mations involving the t distribution are investigated in Iman [1974].
In the example of Sect. 3.1, with n = 15, the mean and variance of T
under the null hypothesis are 0 and 1240 respectively. The approximate
normal deviate corresponding to peT ~ 72) is thus

~ = 2.045 without correction for continuity


)1240

~ = 2.016 with correction for continuity.


'\,! 1240
From Table A we find the corresponding tail probabilities are 0.0204 and
0.0219 respectively. For comparison, the exact tail probability is 0.0206.
The accuracy of the normal approximation can be improved by using
an Edgeworth expansion. Taking terms to order l/n, the Edgeworth approxi-
mation gives the same degree of accuracy with n = 15 as the normal approxi-
mation with n = 100 [Fellingham and Stoker, 1964].
*Formally, we may say that under the null hypothesis, T, T+, and T-
are asymptotically normal (see Sect. 9, Chap. 1) with the mean and variance
given by (3.5)-(3.8). This follows immediately for T+ and T- once it is
proved for T (Problem 21). To prove it for T, note that T has the same dis-
tribution as 2:1 RJ where Rt> R 2 , ••• are independent with P(R j = j) =
peR j = - j) = 1/2 under the null hypothesis. It is easily verified (Problem 22)
that the Liapounov criterion for the Central Limit Theorem (Theorem 9.3
of Chap. 1) is satisfied. *

3.3 Large Sample Power

For n large, the distributions of T+, T- and T are approximately normal


under alternative hypotheses as well as in the null case. Hence, once the
appropriate means and variances under alternative distributions are cal-
culated, a large-sample approximation to the power of the Wilcoxon signed-
rank test can be computed. Asymptotic normality will not be proved here,
but the mean and variance of T+ are found in the theorem below.

Theorem 3.2. Let Xl, ... , Xn be a random sample from some continuous
population. Then
(3.12)
152 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

where PI is the probability that a single X is positive and P2 is the probability


that the sum of two independent X's is positive, that is,
PI = P(Xj > 0) for allj, (3.13)
P2 = P(Xi + Xj > 0) for all i 1= j. (3.14)
The variance is

+ n(n - 1)(n - 2)(P4 - pD, (3.15)

where
P3 = P(Xi > 0 and Xi + X) > 0) for all i 1= j, (3.16)

P4 = P(Xj + Xj > 0 and Xi + Xk > 0) for all distinct i,j, k. (3.17)

PROOF. By using the representation of T + in (3.11), its mean and variance


are easily expressed in'terms of moments of the 7;j defined in (3.10) as

E(T+) = nE(7;i) + n(n - I)E(7;j)/2, (3.18)


var(T+) = L L L L cov(7;j' T"k)
I~j h~k

= n var(7;i) + (;)var(7;) + 2n(n - l)cov(7;j, 7;k)

+ 2n(" ; l)cOV(7;j' 7;k) + (:)cOV(7;j, T"k), (3.19)

where h, i, j and k are all different.


Using the probabilities defined in (3.13), (3.14), (3.16) and (3.17), and the
fact that Xj, Xj are independent for all i 1= j, gives the moments of the 7;j
(Problem 24) as

E(7;i) = PI for all i


E(7;j) = P2 for all i 1= j
var(7;i) = PI - PI = PI(1 - PI) for all i
var(7;) = P2 - p~ = P2(1 - P2) for all i 1= j
cov(7;j, 7;k) = P3 - PIP2 for all i 1= k
cov(7;j' 7;k) = P4 - p~ for distinct i,j, k
cov(7;j, T"k) = 0 for distinct i,j, h, k.

The results in (3.12) and (3.15) follow once these moments are substituted in
(3.18) and (3.19). 0
3 The Wilcoxon Signed-Rank Test 153

The moments in Theorem 3.2 hold for a random sample from any con-
tinuous population. If the population is symmetric about zero, as under the
null hypothesis that the popUlation center of symmetry is zero, the prob-
abilities defined in Theorem 3.2 have the values (Problem 25)
PI = t, pz = t, P3 = j, P4 = t.

3.4 Consistency

A test is called consistent against a particular alternative if the power of the


test against the alternative approaches 1 as the sample size approaches
infinity. Of course, the test must be defined for each sample size. Thus
consistency is, strictly speaking, a property of a sequence of tests, one test
for each sample size. (In some contexts it might be desirable to let the alter-
native also depend on the sample size.)
The Wilcoxon signed-rank test is consistent against any alternative
distribution, even non symmetric, for which the probability is not t that the
sum of two independent observations is positive, that is, for which
pz = P(X I + Xj > 0) =1= t for i =1= j.
(Note that the value of the parameter PI does not determine consistency
here, while that parameter alone determines consistency and even power
for the sign test (Sect. 3.2, Chap. 2).) More precisely, consider a two-tailed
Wilcoxon signed-rank test with level bounded away from 0 as n ~ 00, and a
fixed continuous alternative distribution F for which pz =1= t. We show
below that if XI' ... , Xn are independently distributed according to F for
each n, then the probability of rejection approaches 1 as n ~ 00. Similarly,
the one-tailed test is consistent against alternatives with pz on the appropriate
side of t.

* In order to prove this, we first show a fact of some interest in itself,


namely that if XI' ... ,Xn are independently, identically distributed (accord-
ing to any distribution whatever), then 2T+ /n z is a consistent estimator of
Pz, or equivalently, converges in probability to pz. By the definition given in
Eq. (3.5) of Chap. 1, this means that, for every B > 0,

p(12~+ -pzl >B)~O asn~oo. (3.20)

In this specified sense, the distribution of 2T+ /nz becomes concentrated


near pz as n ~ 00. Note that the distribution of2T+ /nz depends on n through
the numerator as well as the denominator, since the distribution of T+
itself depends on n (although the notation does not explicitly express this
fact). Note also that a more natural estimator of pz is Tri = LLi<j 1ij/m,
which is unbiased for pz (Problem 39). (See also Sect. 5.)
154 3 One-Sample and Paired-Sample Inferences Based 011 Signed Ranks

PROOF. Letting Zn = 2T+ In 2 , by (3.12) and (3.15) we have


E(Zn) -+ P2 and var(Zn) -+ 0 as n -+ 00.

A simple application of Chebyshev's inequality then shows (Problem 26)


that (3.20) holds, that is, Zn converges in probability to P2' so that Zn is a
consistent estimator of P2 .
It remains to show that the equal-tailed Wilcoxon signed-rank test at
level rx is consistent against alternatives for which P2 "# t. Under the null
hypothesis, we have P2 = t; therefore, by (3.20), for any e > 0, we have for
sufficiently large n
Po(IZn - tl > e) < a12.
This implies that the upper and lower (rxI2)-points of the null distribution
of Zn lie between t - e and t + e. Since Zn is just a multiple of T+, it follows
that the test rejects the null hypothesis at least for IZn - tl > e, provided n
is sufficiently large.
Consider a particular alternative with P2 "# t, and let e = IP2 - t 1/2.
Then IZn - t I > e whenever IZn - P21 :S e, so that for sufficiently large n
the test rejects at least for IZn - P21 :S e. But the probability of this event
approaches 1 under the alternative, by (3.20), which proves that the test is
consistent. 0
The type of proof used here works quite generally to show that consistent
estimators lead to consistent tests. Consider a parameter 0, an estimator
which is consistent for e, and a test based on that estimator as a test statistic.
Suppose that (a) e has one value eo under the null hypothesis and a different
value under the alternative, (b) the test rejects in the appropriate tail of the
distribution of the estimator, (c) the probability of rejection under the null
hypothesis is bounded away from 0, and (d) either the distribution of the
estimator under the null hypothesis is completely determined, or the esti-
mator is uniformly consistent under the null hypothesis. (en is a uniformly

to eo under R o, that is for every e > 0, Po(lOn - 00 1 > e) -+ as n -+ 00,


uniformly in the distributions of H0')
°
consistent estimator of eo under R 0 if On converges uniformly in probability

The first step in the proof is to observe that the test must, for sufficiently
large n, reject whenever the estimator falls more than e away from the value
to which it converges in probability under the null hypothesis (more than e
away in one direction if the test is one-tailed). The second step is to consider an
alternative under which the estimator converges in probability to a different
value, and to observe that, under such an alternative, the probability ap-
proaches 1 that the estimator will lie within e of this different value, and hence,
if e is small enough, more than e away from the null hypothesis value, whence
rejection occurs.
Note that it may be necessary to redefine the test statistic so that it
becomes a consistent estimator of something. For instance, we considered
3 The Wilcoxon SIgned-Rank Test 155

above Zn = 2T+ /n 2 instead of T+. Tails of the distribution of T+ correspond


to tails of the distribution of Zn, but Zn converges in probability to P2'
while the distribution of T + moves further and further to the right with
variance increasing as n -+ 00.
A similar approach to a general proof could be based on the test statistic
standardized under the null hypothesis (Problem 29). Under the alternative,
the variance of this statistic is bounded but the mean approaches 00 as
n -+ 00.*

3.5 Weakening the Assumptions

We have been assuming that X 1> ••• , Xn are independent and identically
distributed, with a continuous and symmetric common distribution. The
Wilcoxon test procedures were developed using only the fact that the signs
of the signed ranks are independent and are equally likely to be positive or
negative. The level of the test will therefore be unaffected if, in particular, the
Xj are independent with continuous distributions symmetric about 110'
even if their distributions are not the same (Problem 30). (The level of the
corresponding confidence procedure to be discussed in Sect. 4 is preserved
if the Xj are independent with continuous distributions possibly different,
but all symmetric about the same 11.)
Even independence is not required for the validity of the Wilcoxon test,
provided that the conditional distribution of each X j given the others is
symmetric about 11 (Problem 31). It is valid, for example, under the null
hypothesis that the treatment actually has no effect when the X j are the
treatment-control differences in a matched-pairs experiment, as long as
the controls are chosen independently and at random, one from each pair
(Problem 32).
If the continuity assumption is relaxed, then there is positive probability
of a zero or tie, and the test has not yet been defined. This situation is discussed
in some detail in Sect. 6.
For the one-tailed test, even the assumption of symmetry can be relaxed.
Specifically, suppose the X J are independent and continuously distributed,
not necessarily identically. Consider the one-tailed test at level IX for the
null hypothesis of symmetry about the origin with rejection region in the
lower tail of the negative-rank sum. It can be shown (Problem 35) that this
test rejects with probability at most IX if
P(X J < -x) ~ P(Xj > x) for all x ~ 0 and allj (3.21)
and with probability at least IX if
P(Xj < -x) ::;; P(Xj > x) for all x ~ 0 and allj. (3.22)
Under (3.21), the probability in any left tail of the distribution of Xj exceeds
or equals the probability in the corresponding right tail, so that the dis-
tribution of X j is "to the left of or equal to" (" stochastically smaller" than)
156 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

a distribution symmetric about 0 (Problem 33). Equation (3.22) means the


opposite-the distribution of Xj is "stochastically larger" than a distribu-
tion symmetric about zero. Since the probability of rejection is at most lX
when (3.21) holds, the null hypothesis could be broadened to include (3.21)
without affecting the significance level. Similarly, it is natural to broaden the
alternative hypothesis to include (3.22); since the probability of rejection
is at least lX when (3.22) holds, the test is by definition" unbiased" against
(3.22). This is the least we would hope for in broadening the alternative.
(Of course, any particular distribution whatever could be included in the
alternative hypothesis without affecting the validity of the test in the sense
that the significance level, which depends only on the null hypothesis,
would be preserved. However, we would not want to add an "alternative"
distribution under which the probability of rejection is less than lX.)
The statement above, that the probability of rejection is at most lX under
(3.21) and at least lX under (3.22), follows (Problem 35) from a fact which is of
interest in itself. Suppose a (nonrandomized) test c/> is "increasing" in the
sense that, if c/> rejects at X 1> ••• , X n and any X j is increased, c/> still rejects.
(The upper-tailed Wilcoxon test obviously has this property.) Then the
probability that c/> will reject increases (not necessarily strictly) when the
distribution of Xj is moved to the right ("stochastically increased "), that is,
when the c.d.f. Fj of Xj is replaced by Gj where Gix) :s; Fix) for all x. (In
such a case, we say that G, is stochastically larger than Fj .) Formally, for
randomized tests c/>, we have the following theorem.

Theorem 3.3. If X 1"", Xn are independent, Xj has c.df Fj, and c/>(X 1"" ,Xn)
is a randomized testfunction which is increasing in each Xj' then the probability
of rejection
(3.23)

is a decreasing function of the Fj , that is,

The same holds if the words increasing and decreasing are interchanged
and the first inequality of (3.24) is reversed. (The theorem does not depend
on the fact that c/> is a test, but it will be applied only to tests.)
PROOF. Suppose that X 1> ••• , Xn are independent and Xj has c.d.f. Fj.Then
there exist independent random variables Yt> ... , Y", such that lj has c.d.f.
Gj and P(lj ~ Xj) = 1 for each j (Problem 34). It follows that, for c/> in-
creasing in each argument,
cx(G 1,···, Gn) = E[c/>(Y1 ,···, y")]
~ E[c/>(X 1, ..• , Xn)] = a(F 1, ... , Fn)·

For c/> decreasing, the inequality is reversed. o


4 Confidence Procedures Based on the Wtlcoxon Signed-Rank Test 157

4 Confidence Procedures Based on the Wilcoxon


Signed-Rank Test
Suppose that we have a random sample Xl' ... , Xn from a continuous,
symmetric distribution and we want to find the confidence bounds for the
center of symmetry Jl which correspond to the Wilcoxon signed-rank test.
The confidence region consists of those values of Jl which would be" accepted"
if the usual Wilcoxon test for zero center of symmetry were applied to the
set Xl - Jl, ... , Xn - Jl. We could proceed by trial and error, testing various
values of Jl to see which ones lead to "acceptance." However, the interpreta-
tion ofthe Wilcoxon test statistic as a function of Walsh averages (X i + X i)/2
provides a more direct and convenient method of determining the cor-
responding confidence bounds.
In Theorem 3.1, we found that the test statistics T+ and T- are the number
of positive and negative Walsh averages respectively. Suppose that Jl is
subtracted from every Xi' Since

(Xi - Jl + Xi - Jl)/2 < 0 if and only if (Xi + X i )/2 < Jl,

the number of negative Walsh averages among the (Xj - Jl) equals the
number of Walsh averages less than Jl for the original Xi' Therefore, the
Wilcoxon test with rejection region T- ~ k, when applied to the observa-
tions (Xi - Jl), would reject or accept according as Jl is smaller than or
larger than the (k + l)th smallest Walsh average (Xi + X)/2, counting
from smallest to largest in order of algebraic (not absolute) value. That is, the
(k + l)th Walsh average from the smallest is a lower confidence bound for Jl
with a one-sided (X equal to the null probability that T- ~ k. Similarly, the
(k + l)th from the largest Walsh average is an upper confidence bound at
the same level. Hence, the endpoints of any confidence region based on the
Wilcoxon signed-rank procedure are order statistics of the Walsh averages.
All of the Walsh averages can be easily generated and ordered (sorted) by
computer. However, there is also a convenient graphical procedure for
finding the confidence bounds as Walsh averages. Plot each Xj value on a
horizontal line, the" X axis," as in Fig. 4.1 for the data -1,2,3,4,5,6,9, 13
with n = 8. On one side ofthe X axis (either above or below), draw two rays
emanating from each Xi' as in the diagram. All rays should make equal
angles with the X axis. Then the Walsh averages are exactly the horizontal
coordinates of the intersections of the rays. The points on the X axis must be
included, as they correspond to the n Walsh averages where i = j, that is, the
original observations. The (k + l)th smallest Walsh average can be identified
by counting from the left in the diagram. Its value may be read from the
graph or, if greater accuracy is desired, calculated as the corresponding
(Xi + X j )/2. In the latter case, it may be necessary to calculate more than
one Walsh average to determine exactly which is the (k + 1)th. Continuing
the example, for n = 8 we find from Table D that P(T- $; 3) = 0.0195,
158 3 One-Sample and PaIred-Sample Inferences Based on Signed Ranks

Figure 4.1 Adapted from Gibbons, Jean D. [1971], N onparametric StatistIcal II/jerel/ce,
New York: McGraw-Hili, p. 118, Fig. 3.1. With Permission of the publisher and author.

so that k = 3 corresponds to a one-tailed test at exact level 0.0195. The 4th


smallest Walsh average is therefore a lower confidence bound for f.1 with
confidence coefficient 0.9805, and this number can be read off Fig. 4.1 as
1.5, or identified and calculated as (-1 + 4)/2 = 1.5. Similarly, an upper
confidence bound with the same confidence coefficient is 9.0, and therefore
the two-sided confidence interval, 1.5 < f.1 < 9.0, has confidence coefficient
1 - 2(0.0195) = 0.961.
For k :s; 2, the situation becomes especially simple. The critical values
k = 0, 1,2 give one-sided levels r.t. = rn, 2(2- n), 3(rn) respectively for any
n > k. The corresponding confidence bounds are (Problem 36) X(!), (X(1) +
X(2)/2, and min{X(2), (X(1) + X(3)/2} respectively, where X(I) is the ith
order statistic, that is the ith from the smallest of the X j in order of algebraic
(not absolute) value. For k ~ 3, this sort of description rapidly becomes
more complicated (Problem 37).

5 A Modified Wilcoxon Procedure


In computing the rank sums, suppose that the ranks 1,2, ... , n are replaced
by the numbers 0, 1, ... ,n - 1, which we will call modified ranks. Denote the
new ranks sums by To, Tri, To. As an example, for the data in Sect. 3.1, the
following results are obtained.

x} 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
modified
rankoflXjl 10 13 3 0 4 6 8 2 7 11 5 14 12 9
modified
signed rank 10 -13 3 0 4 6 8 2 7 11 5 14 12 -9
Tri = 83, To = 22, To = 61, n = 15.
5 A ModIfied WIlcoxon Procedure 159

Under the null hypothesis of symmetry about 0, with either the original
or weakened assumptions, To has the same distribution in a sample of size n
as does T- in a sample of size n - 1 (Problem 38). Accordingly, Table D
applies to To, and similarly to Tri, provided that (n - 1) is used in place ofn.
In the example above, then, the one-tailed P-value for the lower tail of To is
0.0290 for the modified test, since this is the probability from Table D that
T- ~ 22 for 14 observations.
With n - 1 in place of n, the formulas in (3.5) and (3.6) for the mean and
variance under the null hypothesis apply to Tri, as does the asymptotic
normality discussed in Section 3.2.
Tri is the number of positive Walsh averages excluding those with i = j,
that is, the number of positive (X i + X j)/2, for i < j (Problem 39). Therefore,
the graphical method of obtaining confidence bounds works for the modified
procedure with slight changes. Specifically, the points on the X axis must now
be excluded, and of course the critical value k is now that for T ri rather than
T+.
Under alternative distributions, we have (Problem 40)
E(T(;) = n(n - l)p2/2 (5.1)
var(Tri) = n(n - 1)(n - 2)(P4 - pD + n(n - 1)P2(l - P2)/2 (S.2)
where P2 and P4 are defined by (3.14) and (3.17) respectively. Although we
will not prove it here, Tri is approximately normal with this mean and vari-
ance. Thus we can compute the approximate power of a test based on Tri.
Since the dominant terms of (5.1) and (S.2) are the same as those of the
corresponding moments of T+ given in (3.12) and (3.1S), it follows that
Tri and T+ are asymptotically equivalent, in a sense which is not trivial to
make precise (see Chap. 8). In particular, the consistency results for T+ in
Sect. 3.4 apply to T ri .
A natural estimator of P2, the probability of a positive Walsh average,
is 2Tri /n(n - 1). When the X) are a random sample from some continuous
population, this estimator is unbiased, by (5.1), and consistent, as just
remarked. If the class of possible distributions is broad enough-all con-
tinuous distributions, for instance-then no other unbiased estimator has
variance as small (Problem 41). On the basis of asymptotic normality, an
approximate confidence interval for P2 could also be obtained from

/ Tri - n(n - l)p2/2/ ~ zJvar(Tri) (S.3)


where z is an appropriate standard normal deviate. From (S.2) we see that
var(Tri) is a function of P4 as well as P2; hence P4 must be eliminated before
(S.3) will yield confidence limits on P2'
We could estimate P4 from the data as the proportion of triples i,j, k, all
distinct, for which Xi + X j and X I + X k are both positive; alternatively,
P4 could be replaced by the upper bound P2 (Problem 42). In either case, the
right-hand side of (5.3) still depends on P2' This inequality could then be
solved for P2 (Problem 43). Another possibility is to estimate var(Tri) as
160 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

given by (5.2), using one of the foregoing methods for P4 and the estimate
2Tt /n(n - 1) for P2' and substitute this estimate of var(Tt) on the right-
hand side of (5.3) (Problem 44).
t
The statistic T is thus especially natural and useful for estimation of P2.
There are two further reasons for introducing the modified Wilcoxon
procedure. The first is that it illustrates the inevitable arbitrariness involved
in the choice of a test or confidence procedure. Neither T nor To has any
definitive theoretical or practical advantage over the other, and they are
equally easy to use. In the example of Sect. 3.1 with n = 15 we get a one-tailed
P-value of 0.0206 from T and 0.0290 from To, and lower and upper confidence
bounds of 10 and 39 each at the one-tailed level 0.0535 from T, 7 and 39 each
at the one-tailed level 0.0520 from To (Problem 45). There is no reason to
say that one set of values is any better than the other, and both methods are
reasonable. A single" correct" or "best" method of statistical analysis does
not exist in anything like the present framework.
The other reason for introducing the modified procedure is that the
choice of exact levels is thereby increased. It may happen that T provides a
nonrandomized procedure at nearly the level desired but To does not, or
vice versa. For instance, suppose the nominal level selected is 0.05 for n = 10.
From Table D, we see that To provides a test at the one-sided level 0.0488
while the one-sided levels for T which are nearest 0.05 are 0.0420 and 0.0527.
If there is some real reason for insisting on one particular level and a ran-
domized procedure is undesirable, the fact that T and To provide tests at
different natural levels may be grounds for choice between them in a parti-
cular case. Of course, with either T or To, the approximate method of
interpolating confidence limits between those at attainable levels may be
used. (This was explained generally in Sect. 5, Chap. 2.) The accuracy of
such interpolation has not been investigated extensively, but results given in
Problem 49 and quoted in Chap. 2 for interpolation midway between any
two successive order statistics, and results given in Problem 50 for inter-
polation between the two smallest (or largest) Walsh averages, suggest that,
especially when discreteness matters most, the accuracy is often quite poor
but interpolation tends to be conservative.

6 Zeros and Ties


6.1 Introduction

An observation which equals 0 is called a "zero;" its sign has not been
defined. Two or more observations which have the same absolute value
(zero or not) are called a "tie;" their ranks (in order of absolute value) have
not been defined. As a result, if zeros or ties occur, the sum of the signed
ranks cannot be determined without some further specification. We have
6 Zeros and Ties 161

avoided this difficulty so far by assuming continuous distributions; this


ensures that the probability of a zero or a tie is equal to zero. In practice, it is
necessary to deal with zeros and ties because measurements are not always
sufficiently precise to avoid them, even with theoretically continuous distri-
butions, and discontinuous distributions do occur. (The modified Wilcoxon
procedure of Sect. 5 assigns rank 0 to the smallest observation in absolute
value, whether it is a zero or not. Therefore, a single zero creates no problem
in this procedure. Otherwise, the discussion below applies to it also, with the
necessary changes.)
Of course, the Walsh averages are well defined even with zeros or ties,
although some of them may now be tied. The confidence procedures given
earlier in Sect. 4 depend only on the values of certain order statistics of the
set of Walsh averages, and these values are well defined for any data (and
easily found by the graphical procedure or some other technique). It can be
shown that, if k is defined as before and found from Table D, the (k + l)th
smallest average, say L, is still a lower confidence bound for the population
center of symmetry Ji in the present situation. However, the exact statement
for the confidence property is now more delicate. The lower confidence
bound L has at least the indicated probability of falling at or below Ji, and at
most the indicated probability of falling strictly below Ji. That is
P(L S Ji) ~ 1 - CI. ~ P(L < Ji}, (6.1)
where 1 - CI. is the exact confidence level in the continuous case (Problem 51;
see also Problems 35 and 109). A corresponding statement holds for the
upper confidence bound. Accordingly, a two-sided confidence interval
including its endpoints has at least the indicated probability of covering Ji,
while the interval excluding its endpoints has at most the indicated probability
of covering Ji. Thus the confidence procedures (one- and two-sided) of Sect. 4
are still applicable. The only question is whether or not to include the end-
points in the interval; ordinarily this does not matter and the question need
not be resolved.
Since the confidence procedures are still applicable, a test of hypothesis
about Ji at level CI. can be performed by finding the corresponding confidence
bound(s} at level 1 - CI.. If Jio is not an endpoint of this region, the test
"accepts" or rejects according as Jio is inside or outside the region. If Jio is
an endpoint, it may be sufficient to state this fact and not actually carry the
test procedure further (or to be "conservative" and "accept" the null
hypothesis. As we shall see, this is equivalent to the" conservative" procedure
of rejecting Jio if and only if resolving all ambiguities in the definition of
signed ranks in favor of "acceptance" would still lead to rejection.) It may
be desirable to proceed further, however, in order to report an accurate
P-value or to obtain more nearly the desired significance level, and to avoid
the loss of power resulting from" conservatism," which is serious in situations
where many zeros or ties are likely. The following subsections will discuss
various methods of handling ties and zeros. For some of these methods, the
162 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

null distribution of the Wilcoxon test statistic as given in Table 0 can still be
applied, while others produce a different null distribution so that ordinary
tables cannot be used. Furthermore, some other surprising traps and
anomalies can arise. See Pratt [1959] for a more complete discussion than
is given here.

6.2 Obtaining the Signed Ranks

There are several different methods of obtaining signed ranks for observa-
tions which include zeros and/or ties. We will now describe each of these
methods briefly. In the next three subsections, we illustrate the resulting
test procedures and discuss certain properties of them.
Ties may occur in only nonzero observations, in only zero observations,
or in both simultaneously. Zeros may occur with or without ties. We will
consider each relevant case starting with nonzero ties.
The three basic methods of handling nonzero ties which we shall discuss
are called (a) the average rank method, (b) breaking the ties randomly, and
(c) breaking the ties "conservatively."
The average rank (or midrank) method assigns to each member of a group
of tied observations the simple average of the ranks they would have if they
were not tied. In general, for a set of tied observations larger than the (k - 1)th
largest and smaller than the (I + l)th largest observations, that is, tied values
in rank positions k through I, the average rank is (k + 1)/2, which is always
an integer or half integer, and this rank is given to each of the observations
in this tied set. This approach gives a unique set of ranks, and tied observa-
tions are given tied ranks. Since the Wilcoxon statistic uses the signed ranks
for the absolute values of the observations, the possible ranks are averaged
for sets of tied absolute values of the observations and then signs are attached
to these resulting average ranks. (Examples will be given shortly.)
Methods which break the ties give each observation, including the tied
values, a distinct integer rank, that is, as if there were no ties. Two methods
of doing so are the" random" method and the" conservative" method. In
the present context, with tied absolute values at ranks k through I, if m of
these belong to negative observations, the random method would attach
the m minus signs to a sample of size m drawn at random without replace-
ment from the integers k through I. The "conservative" method would
attach the m minus signs to the smallest m of these integers when testing for
rejection in the direction of smaller values of jl, the largest m in the other
direction, thus breaking all ties in favor of "acceptance." Other methods of
breaking ties will not be considered here.
For zeros, which may occur singly or as ties, there are analogs of each of
the foregoing methods. Still another method is to discard the zeros from the
sample before ranking the remaining observations. This last method will be
called the reduced sample procedure. For the signed-rank test, Wilcoxon
6 Zeros and Ties 163

[1945] recommended this practice, with the ordinary test then being applied
to the reduced sample if there are no nonzero ties. However, this leads to a
surprising anomaly, which will be discussed later. Further, one might argue
purely intuitively that the ambiguity about the signed ranks of the zero
observations is irrelevant to the ranks of the nonzero observations, and hence
these ranks should not be changed by discarding the zeros. Nevertheless,
if the zeros are retained, they must also be given signed ranks. One possi-
bility is to give each zero a signed rank of zero; we call this the signed-rank
zero procedure. The signed-rank zero is actually an average rank, since it is
the average of the two signed ranks which could be assigned to any zero,
regardless of what method is used to obtain the unsigned rank for the zero.
Tiebreaking procedures, on the other hand, would assign the signed ranks
± 1, ± 2, etc., to the zeros, choosing the signs either randomly (independently
with equal probability) or "conservatively" (all + signs when testing for
rejection in the direction of smaller values of jl, all - signs in the other
direction).
If both nonzero ties and one or more zeros are present, it would be
possible to use anyone of the procedures for zeros in conjunction with
anyone of the procedures for nonzero ties. With any procedure used for
nonzero ties, however, it is natural to use either the corresponding procedure
for zeros or the reduced sample procedure. When the "conservative"
procedure is viewed as inadequate, we recommend that the signed-rank
zero method be used in conjunction with the average rank procedure, for
reasons indicated later in Sects. 6.4 and 6.5.

6.3 Test Procedures

We illustrate the basic methods of handling ties, with zeros also present,
for the following data (arranged in order of absolute value):
0, 0, -1, -2, 2, 2, 3. (6.2)
(a) The assignment of ranks to all observations by the average rank
method, in combination with the signed-rank zero procedure, is illustrated in
Table 6.1 for the data in (6.2). By the average rank and signed-rank zero
methods, the values of the Wilcoxon signed-rank statistics are T+ = 17,
T- = 8, T = 9. Note that the relationships between T+, T- and T, as given
in (3.1)-(3.3), must be modified; if v zeros are present, then v( v + 1) must be
subtracted from n(n + 1) throughout (Problem 52d).

Table 6.1
Xj 0 0 -1 -2 2 2 3
Possible ranks 1,2 1,2 3 4,5,6 4,5,6 4,5,6 7
Average rank 1.5 1.5 3 5 5 5 7
Signed rank 0 0 -3 -5 5 5 7
164 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

The null distributions of these test statistics are not as given in Table D,
as is obvious since the ranks with signs are not the first n integers. An exact
test can, however, be performed conditionally on the number of zeros
present and the ranks assigned to the nonzero observations. Under the null
hypothesis of symmetry about 0, given the pattern of zeros and ties (and even
the absolute values) present, the conditional distribution of the positive and
negative rank sums by the average rank method is determined by the fact
that all assignments of signs to the average ranks of the nonzero observations
are equally likely.
For the data in (6.2), given the absolute values present, including two
zeros, three observations tied at ranks 4-6, and two untied observations at
ranks 3 and 7, there are 2 5 equally likely assignments of + and - signs to the
relevant average ranks, which are 3, 5, 5, 5, and 7. We enumerate these
assignments and calculate T- as follows.

Number of
Negative Ranks r- Cases

none 0
3 3
5 5 3
7 7 1
3, 5 8 3
3,7 10 1
5,5 10 3
5, 7 12 3
3,5,5 13 3
3,5,7 15 3
5,5,5 15 1
5,5,7 17 3
3,5,5,5 18 1
3,5,5,7 20 3
5,5,5,7 22 1
3,5,5,5,7 25

(T+ and T- are equivalent conditional test statistics, since T+ = 25 - T-


given two zeros.) The conditional null distribution of T- is then as follows.

0 3 5 7 8 10 12 13 15 17 18 20 22 25
25 P(r- = t) 1 3 3 4 3 3 4 3 1 3 1
25 P(T- s; t) 2 5 6 9 13 16 19 23 26 27 30 31 32

In the sample observed, we had T- = 8. The probability of a value at least


as small as that observed, that is, the P-value, is
P(T- S 8) = -fr = 0.28.
6 Zeros and TIes 165

The probability of a value smaller than that observed, the next P-value, is
P(T- < 8) = P(T- ~ 7) = 362 = 0.19.
When there are relatively many ties, as here, the normal approximation
may be very inaccurate. The relevant mean and variance are easily obtained
and the distribution is symmetric (Problem 52), but the uneven spacing of
the possible values of T- makes correction for continuity difficult and of
doubtful value. Further, general lumpiness of the distribution removes any
possibility of high accuracy, even when a continuity correction is used.
However, when enumeration of the null distribution is too laborious, one
could use the Monte Carlo method (simulation) described further in Sect. 2.5,
Chap. 4. A table of critical values for the case when there are zeros but no
nonzero ties is given in Rahe [1974].
(b) If a tiebreaking procedure is applied to the data in (6.2) without
omitting the zeros, there are 12 different possible signed-rank assignments
for the possible ranks shown in Table 6.1, obtained as follows. The two
zeros, with possible ranks 1 and 2, could be given signed ranks either 1 and 2,
or -1 and 2, or 1 and - 2, or -1 and - 2. The three observations with
absolute value 2 must each be assigned one of the ranks 4, 5, and 6. When
signs are attached, one of these ranks must be negative since one of the 2's
was negative. Thus the signed rank associated with - 2 is either - 4, - 5, or
- 6, and the two observations which are + 2 have the remaining two ranks
with positive sign. The observation -1 has signed rank - 3 and the observa-
tion 3 has signed rank 7 in any case.
As a result of these possibilities, breaking the ties in this set of observations
could lead to anyone of 4(3) = 12 sets of signed ranks. In all 12 cases, the
negative-rank sums are between 7 and 12 inclusive. The two methods of
breaking the ties which produce (i) the smallest, and (ii) the largest, negative-
rank sum are shown in Table 6.2. There are in addition two cases with
T- = 8, three with T- = 9, three with T- = to, and two with T- = 11,
as shown in Table 6.3 (Problem 53).

Table 6.2
x) 0 o -1 -2 2 2 3 T-
Signed ranks (i) 1 2 -3 -4 5 6 7 7
Signed ranks (ii) - 1 -2 -3 -6 5 4 7 12

Table 6.3 Results of Breaking the Ties


Number of cases 2 3 3 2
y- 7 8 9 10 11 12
Wilcoxon tail probability 0.1484 0.1875 0.2344 0.2891 0.3438 0.4063
166 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

The random method of breaking ties selects one of the possible resolutions
of ties by using some supplementary random experiment which ensures that
each possible set of signed-rank assignments is equally likely to be chosen.
Because this preserves the usual null distribution of the Wilcoxon signed-
rank statistic (Problem 54), the usual critical value can then be used, or the
P-value can be found from Table D. Thus one of the columns of Table 6.3
would be selected, with probability proportional to the number of cases.
Instead of actually breaking the ties at random and using standard tables,
we might report the probability for which doing so would lead to rejection,
as in reporting cfJ(x) for a randomized test (Sect. 5.2, Chap. 1). For example,
for the datu of (6.2), if we were using the critical value t = 8, three of the
twelve possible ways of breaking ties would lead to rejection. If the ties
were broken at random, therefore, there would be a 132 = 0.25 chance of
rejection. Instead of actually randomizing, one could report this probability.
(For P-values one could in principle report similarly that randomizing
would give P = 0.1484 with probability -h, P = 0.1875 with probability 122'
etc., or some summary of this information.)
The "conservative" method of breaking ties, when rejecting for small
values of T-, would assign negative signs to the two zeros of the data (6.2),
and assign rank 6 rather than 4 or 5 to the negative observation - 2, so as
to maximize T -, as in the bottom line of Table 6.2. The "conservative"
value of T- would therefore be T- = 12 and the "conservative" P-value
from Table 6.3 is 0.4063. This means that all methods of breaking ties would
give T- S; 12 and would reject at any critical value of 12 or more and hence
at any level oc ;::: 0.4063.
The other side of the conservative coin is that no method of breaking
ties would give a value T- < 7. Hence all methods of breaking ties lead to
"acceptance" for any critical value of 6 or less. For critical values between 7
and 11 inclusive, however, the conclusion is indeterminate, and the ordinary
Wilcoxon test would reject by one resolution of ties but not by another.
Approaching the hypothesis testing problem from a confidence interval
viewpoint sheds some light on the interpretation of results when ties are
broken in all possible ways. The Wilcoxon signed-rank test for the null
hypothesis p. = 0 should presumably" accept" p. = 0 when 0 is inside and
reject when 0 is outside the usual confidence interval defined by the order
statistics of the Walsh averages. By Problem 55, the ordinary signed-rank
test leads to rejection no matter how the ties are broken if and only if 0 is
outside the confidence interval and not an endpoint of it. The same holds for
"acceptance" and" inside." The remaining possibility, that 0 is an endpoint
of the confidence region, occurs when and only when the ordinary Wilcoxon
test would reject by one method of breaking the ties but not by another.
Hence, instead of resolving the indeterminacy as to whether the test" accepts"
or rejects p. = 0, one might simply state that 0 is an endpoint of the corre-
sponding confidence interval, as suggested earlier. For the data in (6.2),
we have seen that indeterminacy occurs for the critical values 7-11, and it is
6 Zeros and Ties 167

easily verified that the corresponding confidence bounds, the Walsh averages
at ranks 8-12, are all 0 (Problem 56).
Note that these comments apply only to tiebreaking procedures, meaning
that the ordinary Wilcoxon test is used after the ties are broken. They
unfortunately do not apply to the average rank procedure, which, as we
shall see, may give a conclusion opposite to that based on a tie breaking
procedure even when the latter is determinate. Though inconvenient, this
is not a telling objection to the average rank procedure, since there is nothing
sacrosanct about the Wilcoxon procedure itself, even in the absence of ties.
(c) A reduced sample procedure would omit the two zeros in the data of
(6.2), leaving a sample of size 5 with a three-way nonzero tie. The tie could
be handled by any of the methods described above. We shall not illustrate
this procedure here. However, note that it can disagree with tiebreaking in
the complete sample, and has a still more objectionable property which will
be described shortly.

6.4 Warnings and Anomalies: Examples

Nonzero ties

For a given pattern of zeros and ties (strictly, for given absolute values), if
the average rank procedure would be used in some cases, it is to be used in all
cases, even those where tiebreaking is unambiguous. Furthermore, it may
lead to the opposite conclusion. Thus, it is not a valid shortcut to use tie-
breaking when it is unambiguous and average ranks when tiebreaking is
indeterminate. Similar comments would presumably apply to other pro-
cedures for handling zeros and ties, in the absence of proof to the contrary.
To illustrate the difficulty, consider the following sample, in which the
tied observations all have the same sign:
1, 1, 1, 1, 2, 3, -4, 5. (6.3)
Any method of breaking the ties gives the same signed ranks, namely
1, 2, 3, 4, 5, 6, -7, 8. (6.4)
Thus, one is tempted to apply the Wilcoxon test to these signed ranks without
further ado. The null probability of a negative-rank sum of less than 7 is
14/28 while that of 7 or less IS 19/28 (0.0547 and 0.0742 respectively, from
Table D). Hence, when the ties are broken in any way, (6.3) would be judged
not significant at anyone-sided level ()( :s; 0.0547 and significant at any level
()( ~ 0.0742. (For 0.0547 < ()( < 0.0742, the exact level ()( is unavailable, but
such values of ()( are not required for the present discussion.)
Now if the average rank procedure is used on the sample in (6.3), the
signed ranks are
2.5, 2.5, 2.5, 2.5, 5, 6, - 7, 8, (6.5)
168 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

and the test statistic still has the value T- = 7. The left tail of the null
distribution of T- given the ties at ranks 1,2,3, and 4 is shown below (Prob-
lem 57).
o 2.5 5 6 7 7.5 8 8.5 9.5
4718144
Hence, by the average-rank procedure, P(T- ::; 7) = 14/2 8 = 0.0547, and
the sample in (6.3) would be judged significant at the level 0.0547, even
though it is not significant at this level when the ties are broken, no matter
how they are broken. For (X = 0.0547, the two methods reach opposite
conclusions. (This is true for all (X in some interval including 0.0547, but
this interval depends on what is done when the exact level (X is not available.
Furthermore, similar disagreement in the other direction is also possible
(Problem 58).)
In terms of P-values, the exact P-value is 0.0547 by the average-rank
procedure, while it is 0.0742 by the Wilcoxon test no matter how the ties are
broken. In terms of confidence bounds, for (X = 0.0547, the lower confidence
bound is 1 by the average-rank procedure but -0.5 by the usual procedure.
Thus the two procedures give bounds with opposite signs (and 0 is not an
endpoint of either confidence interval).
There is no contradiction here, but there is a warning; namely, it is not
valid to use a tiebreaking procedure when tiebreaking is unambiguous,
not even when all tied observations have the same sign, if one would have
used another procedure (such as average ranks) in other cases (with the
same absolute values).
Consider, for example, a sample with the same absolute values as (6.3),
but different observed signs, as in (6.6):
-1, -1, -1, 1, 2, 3, 4, 5. (6.6)
Using the average rank procedure, we find T- = 7.5. The null distribution
of T- for the average rank procedure is the same as that given above for
sample (6.2), so that P(T- ::; 7.5) = 22/2 8 = 0.0859. But this computation
of the null distribution of T- by the average rank procedure for sample (6.6)
assumes that the signed ranks for sample (6.3) are those of (6.5), not those of
(6.4) which result from breaking the ties. Thus, if we would use the average
rank procedure for the sample (6.6), we must also use it for the sample (6.3),
even though tiebreaking would be easier and unambiguous for (6.3). The
alternative would be to use tiebreaking in both samples (and others with the
same absolute values). When the ties in (6.6) are broken, the possible values
of T- are 6 ::; T- ::; 9 with corresponding P-values from Table 0 as
0.0547 and 0.1250 at the extremes. If this degree of ambiguity is too great,
one might prefer the average rank procedure, but one must make this
decision in advance, without knowing the signs and hence without knowing
whether the actual sample is (6.3), or (6.6), or some other sample with the
same absolute values.
6 Zeros and Ties 169

Zeros
In the case of zeros, if there are no nonzero ties, it can be shown that the
signed-rank zero procedure gives the same conclusion as tiebreaking
whenever the latter is unambiguous (Problem 59). The reduced sample
procedure, however, may exhibit strange behavior in this and other respects,
as can be illustrated by applying it to the 13 observations
0, 2, 3, 4, 6, 7, 8, 9, 11, 14, 15, 17, -18. (6.7)
Dropping the zero before ranking leaves 12 observations with a negative-
rank sum of 12, which is not significant at anyone-sided le"el IX ~ 55/2 12 =
0.0134 and is significant at any IX ;;:: 70/2 12 = 0.0171, these being the null
probabilities of less than 12 and 12 or less, respectively. On the other hand,
tiebreaking in favor of the null hypothesis, assigning 0 the signed rank -1,
would result in 13 observations with negative ranks 1 and 13 and a negative-
rank sum of 14, which is significant at IX = 109/2 13 = 0.0133, the null
probability of 14 or less. Thus, for 0.0133 ~ IX ~ 0.0134, the reduced sample
procedure disagrees with tie breaking even though tiebreaking is unam-
biguous (and, as before, disagreement occurs for a wider range of IX which,
however, depends on what is done when the exact level (X is not available).
The P-value is 0.0171 by the reduced sample procedure, but it cannot
exceed 0.0133 if the zero is retained, no matter what sign is given to the
zero. These results are comparable to those above for the average rank
procedure for nonzero ties. When we examine confidence regions, however,
an anomaly appears which is more striking and disturbing than before.
For IX = 0.0133, the usual lower confidence bound is 1. What is the
confidence region by the reduced sample procedure? It contains jJ. = 0,
since the reduced sample procedure would accept this hypothesis at this
level. If an amount jJ. between 0 and 1 is subtracted from every observation,
there will be no zero or tie, so the usual procedure will be used, and will
reject the value jJ.. This is already strange-the sample is not significant
as it stands but becomes significant in the positive direction if every observa-
tion is reduced by the same small amount. Correspondingly, the confidence
region is not an interval; it contains the point jJ. = 0, excludes all other values
jJ. < 1, and contains all jJ. > 1. Thus it is an interval plus an exterior point.
(Strictly speaking, for those integer and half-integer values of jJ. where
nonzero ties occur, the procedure has not been defined, but the statement
holds for the average rank procedure and for any tiebreaking procedure.)
It is also possible for the reduced sample procedure to judge a sample
significant in the positive direction, yet not significant when every observa-
tion is increased by the same small amount, corresponding to a confidence
region which is an interval with an interior point removed (Problem 61).
Thus the reduced sample procedure is not only inconsistent with tie-
breaking, but also inconsistent with itself, in the sense that shifting the
sample in one way may shift the conclusion the opposite way. The signed-
rank zero and average rank procedures are self-consistent in this sense.
170 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

6.5 Comparison of Procedures

In this subsection we will conclude our discussion of various methods of


handling zeros and ties. We shall list some requirements which seem in-
tuitively desirable to avoid anomalies, and then check the various methods
against them. Of course the most important consideration in general is
power, but this is not the main focus of the present discussion. All the methods
are variants of the Wilcoxon procedure, and if one were seriously trying to
improve upon the power of the Wilcoxon test in some respect, one would
probably be led to a different procedure altogether. Ease of application is
also a consideration, but this depends heavily on individual circumstances
and hence will not be focused on either.
We have in mind alternatives involving primarily changes in location
rather than shape. Accordingly, the following requirements seem intuitively
desirable.
(i) A significantly positive sample shall not become insignificant nor an
insignificant sample significantly negative when (a) some observations are
increased, or (b) all observations are increased by equal amounts. (Require-
ment (b) is weaker than (a). It is included because it seems even more de-
sirable than (a) and because some procedures will satisfy (b) and not (a).)
(ii) Those values of the center of symmetry p. which would be "accepted"
if tested shall form an interval. (This says that the corresponding confidence
region shall be an interval, and is equivalent to (i)(b) (Problem 62).)
(iii) A sample shall be judged significantly positive if it is significantly
positive however the ties are broken; similarly for significantly negative and
not significant. (This is implied by (i) (a) but not by (i)(b).)
Consider first the methods of handling zeros. The data in (6.7) show
that the reduced sample procedure satisfies none of the conditions above,
no matter how nonzero ties are handled if they are present. These factors
are the primary justification for our recommendation against this procedure
in Sect. 6.2, even though the ordinary Wilcoxon tables can be used if there
happen to be no nonzero ties, once n is reduced appropriately. They also
supplant considerations of power for the kinds of situation we have in mind,
although Conover [1973] exemplifies situations where the reduced sample
procedure is both more and less powerful than the signed-rank zero pro-
cedure.
The signed-rank zero procedure, in the absence of nonzero ties, satisfies
all of the above conditions (Problem 59).
The average rank procedure (in conjunction with the signed-rank zero
procedure) satisfies (i)(b) and (ii) but not (i)(a) and (iii) (Problem 60). This
procedure presumably gives better power, at least in any ordinary situation,
than breaking ties either "conservatively" or randomly. The regular tables
do not apply, and the null distribution must be generated for the set of
average ranks observed.
The "conservative" procedure, that is, breaking the ties and choosing
the value of the statistic least favorable to rejection, satisfies all of these
7 Other Signed-Rank Procedures 171

conditions. However, the true level IX is unknown and may be much less than
the nominal level, and considerable loss of power may result, especially if
many ties are likely.
Breaking the ties at random, whether by actually doing so and using
standard tables or by reporting the probability with which doing so would
lead to rejection, satisfies all of the above conditions and also the following
version of (i) (Problem 63), which is stronger for randomized test procedures.
(i') The probability that a sample is judged significantly positive shall not
decrease, nor the probability that it is judged significantly negative increase,
when (a) some observations are increased, or (b) all observations are in-
creased by equal amounts.
Breaking the ties at random permits use of the regular tables but then the
analysis depends on an irrelevant randomization. Imposing this extraneous
randomness in an artificial way is unpleasant in itself, and presumably
greater power could be achieved without it [see also Putter, 1955]. The
unpleasantness can be somewhat mitigated, but not entirely eliminated, by
reporting instead the probability with which breaking the ties would lead
to rejection for the sample at hand, rather than actually breaking the ties
in one particular, randomly chosen way. This, however, requires additional
calculation.
However reasonable the properties above may seem in general, in par-
ticular cases larger observations may not be greater evidence of positivity
in the population (Problem 64). Even the normal-theory t statistic does not
satisfy (i)(a) and may decrease when some observations are increased, since
this affects the sample variance as well as the mean (Problem 65). Neverthe-
less, in the absence of information about the underlying distribution, as in
the present non parametric context, the conditions appear desirable in-
tuitively.
Because the average rank procedure does not satisfy (i)(b) and (iii), it is
all the more tempting to resort to it only when tiebreaking is indeterminate,
i.e., only for flo at the end of the usual confidence interval. Unfortunately
there seems to be no easy way to do this and preserve the level IX. Accordingly
our recommendation is to use the "conservative" procedure if it is not too
conservative in view of the type of inference desired and the extent of zeros and
ties expected. (If one is testing a null hypothesis, and not forming a con-
fidence interval, one may look at the absolute values present before deciding,
but not at the signs.) Otherwise, we recommend the average rank procedure
in conjunction with the signed-rank zero procedure.

7 Other Signed-Rank Procedures


Now we return to the situation where the random sample, Xl, ... , X n , is
drawn from a continuous population (so that ties need not be considered).
For tests of the null hypothesis that the population is symmetric about 0,
172 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

the Wilcoxon signed-rank procedure uses the ranks 1, 2, ... , n in place of


the absolute values of the observations. In Sect. 7.1 we consider procedures
which employ some other set of constants c 1> c2, ... , Cn in place of the
absolute values. The sum of the signed constants is the test statistic.
In Sect. 7.2, we will see that these and all other (permutation-invariant)
signed-rank tests depend, like the Wilcoxon test, only on the signs of the
Walsh averages. The Walsh averages therefore again determine the boundary
points of the corresponding confidence regions for the center of symmetry.
These regions are discussed in Sect. 7.3. Some particular tests and confidence
procedures involving Walsh averages are presented in Sect. 7.4.

7.1 Sums of Signed Constants

Suppose the observations Xj are replaced in order of the ranks of their


absolute values by the constants c l , C2' ••• , Cn' and that the sign of the
corresponding X is attached to each constant. The result is called a signed
constant, corresponding to the term signed rank defined in Sect. 3.1. To be
more specific, suppose that Xj has rank k in order of absolute magnitude;
then its signed constant is + Ck if Xj is positive and - Ck if X J is negative.
For the data in Sect. 3.1, this gives the following results.

Xj 49 -67 8 16 6 23 28 41 14 29 56 24 75 60 -48
k=rankoflXjl II 14 2 4 1 5 79 3 812 61513 10
signed Ck CII -C 14 C2 C4 C1 Cs C7 C9 C3 C8 C12 C6 CIS CI3 -ClO

Analogously to the Wilcoxon signed rank statistic T, we define a statistic


which is the sum of these signed constants. Tests based on the sum of the
signed Ck'S are equivalent to tests based on the sum of the negative Ck'S,
or the sum of the positive Ck'S (Problem 79).
Under the null hypothesis that the population is symmetric about 0,
all assignments of signs are equally likely. This fact determines the null
distribution of the test statistic. Therefore, tables could be generated for any
particular set of constants Ck, although not as easily as in Problem 11 for
the Wilcoxon case unless the Ck for different n satisfy some special relation-
ships.
If the hypothesized center of symmetry is some value other than 0, say
110, the foregoing test can be applied to the X j - 110' The set of values of
110 which would be accepted by this procedure forms a confidence region for
an assumed center of symmetry 11. The confidence bounds are Walsh aver-
ages, but for arbitrary Ck they are determined differently from the Wilcoxon
procedure, as we shall see in Sect. 7.3.
For Ck = k, the sum of signed-constants procedure is identical to the
Wilcoxon signed-rank procedure. For Ck = k - 1, it is the modified Wilcoxon
procedure of Sect. 5. For ck = 1, we have the sign test of Chap. 2, and the
corresponding confidence bounds, which are order statistics. Other possi-
7 Other Signed-Rank Procedures 173

bilities for Ck arise naturally (in Sect. 9, for instance). However, even if
appropriate tables are available, these other tests (and especially the cor-
responding confidence procedures) are at least somewhat more difficult to
use than those just mentioned. In the absence of tables, the normal approxi-
mation could be used (Problem 80), but it has only limited, though perhaps
adequate, accuracy in small samples. In large samples, where the normal
approximation is more accurate, confidence limits may be preferable to
tests but especially difficult to obtain. With more effort, the null distribution
for any particular set of constants Ck can be obtained by enumeration, or
approximated by Monte Carlo methods (simulation). These approximations
will be discussed in the next chapter, in connection with "observation-
randomization" procedures (where tabulation is impossible). Here, we need
only note that the distribution can be determined or at least approximated
well. In some problems, the advantages of these procedures may warrant
the extra effort in analysis.
For any signed-rank test and corresponding confidence procedure, the
assumptions made in the introduction to this section can be relaxed as in
the first two paragraphs of Sect. 3.5. In some circumstances, the continuity
assumption can also be relaxed so that ties have positive probability; in
particular, tests based on sums of signed constants and corresponding
confidence bounds are conservative for discrete distributions if the Ck
satisfy 0 S Cl S C2 S ... S Cn (Problem 109). For one-tailed signed-rank
tests, the assumption of symmetry can also be relaxed as in Sect. 3.5, provided
the test satisfies the condition of Theorem 3.3. This condition is satisfied,
in particular, by anyone-tailed test based on a sum of signed constants Ck
such that 0 ::;; CI ::;; ... ::;; Cn (Problem 83).

7.2 Signed Ranks and Walsh Averages

The tests based on sums of signed constants Ck depend only on the signed
ranks of the Xj; if X J has signed rank -k, the signed constant is -Ck' and
if X j has signed rank + k, the signed constant is + Ck' Tests depending only
on the signed ranks, including those in this general class and many more,
are called signed-rank tests.
We have already seen that the Wilcoxon signed-rank test depends only
on the signs of the Walsh averages (because the positive-rank sum is the
number of positive Walsh averages, by Theorem 3.1), and that the corre-
sponding confidence limits are order statistics of the Walsh averages. We
will see in this subsection that all signed-rank tests are in practice equivalent
to tests depending only on the signs of the Walsh averages, and in the next
subsection that the corresponding confidence limits are always Walsh
averages.
The exact statement of the relationship for tests is given in the following
theorem.
174 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

Theorem 7.1. The signed ranks determine the signs of the Walsh avemges,
so any test depending only on the signs of the Walsh averages is a signed-rank
test. Conversely, the signs of the Walsh averages determine the signed ranks
except possibly for the order in which they occur. Therefore, any signed-rank
test which does not take into account the order of the signed ranks depends
only on the signs of the Walsh averages.

The proof of this theorem is similar to that of Theorem 3.1 (Problem 81).
To illustrate the point about o!u~r, wnsider a sample XI, X 2, X 3 whose
Walsh averages have the signs given below.

j sign of (Xi + X)/2


1 1 +
1 2
1 3 +
2 2
2 3
3 3 +

Clearly, Xl and X 3 are positive, and X 2 is negative with larger absolute


value than either X 1 or X 3' However, there is no way to determine whether
the respective signed ranks of Xl' X 2, X 3 are 2, - 3, 1 or 1, - 3, 2. Thus the
signs of the Walsh averages determine the signed ranks collectively, but
they do not completely determine their order.
Of course, intuitively, there is no reason to care about order anyway in
the situations of concern here, or to think that order is relevant. The tests
we have been considering do not take the order of the Xj into account,
that is, they are invariant under permutations of the Xj (see Sect. 8.1). For a
permutation-invariant test, Theorem 7.1 says that it is a signed-rank test
if and only if it depends only on the signs of the Walsh averages.

7.3 Confidence Bounds Corresponding to Signed-Rank Tests

Given a sample XI' ... , X n from a continuous, symmetric distribution,


we now want to find the confidence region for the center of symmetry Jl
which corresponds to some (particular) signed-rank test. The region is,
of course, the set of values of Jl which would be accepted if the test were
applied to the values Xj - Jl. As Jl varies, the signed ranks of the Xj - Jl
will change only when Jl = (Xi + X j )/2 for some i,j, and hence the outcome
of a signed-rank test for Jl will change only at these values of Jl (Problem 84.
For permutation-invariant tests this is also a consequence of Theorem 7.1.)
It follows that the boundary points of the confidence region corresponding
7 Other Signed-Rank Procedures 175

to any signed-rank test are Walsh averages of the original sample. (The
region will be an interval provided the test satisfies condition (i)(b) of Sect.
6.5. This holds for a test based on a sum of signed Ck'S provided Ck + 1 :::=:
Ck :::=: 0 for all k (Problem 85).)
As a result, the confidence limits corresponding to signed-rank tests are
always Walsh averages. However, except in the Wilcoxon case, the confidence
limit at a given level does not always have the same rank among the Walsh
averages, and the ordered Walsh averages have different confidence levels in
different samples. For certain tests, such as the sign test and others mentioned
in the next subsection, the relevant Walsh average can be easily identified.
In general it cannot, but the following trial and error procedure seems likely
to identify it fairly quickly in most cases.
Consider the Walsh averages arranged in order of algebraic size. Let
T(Il) be the value of the test statistic for the hypothesized value 11. For a
signed-rank test, T(Il) is constant for 11 in each interval between adjacent
Walsh averages. Suppose, as is usual, that T(Il) is a monotone function of 11,
so that its values in successive intervals are increasing (or decreasing)
throughout - ex) < 11 < ex). Start the search with some Walsh average,
such as the Wilcoxon confidence bound or a Walsh average close to the
normal-theory bound or to some other approximate bound appropriate
to the test being used. Find the value of the test statistic T(Il) for 11 just below
the starting point (i.e., between it and the next smaller Walsh average).
Move to the next smaller or greater Walsh average depending on whether
T(Il) is smaller than or greater than the critical value of the test. Continue
(in the same direction) until the value of T(Il) equals the critical value, or
until successive values bracket it. The Walsh average which separates
rejectable from "acceptable" values of T(Il) is the confidence bound sought.
It may be helpful to rank beforehand all Walsh averages in what seems to be
the relevant range. At each step, at most two signed ranks will change,
and it may be easier to calculate the change in T(Il) than to recalculate
T(Il) from scratch (Problem 84; see also Bauer [1972]).

7.4 Procedures Involving a Small Number of Walsh Averages

If, as for the Wilcoxon confidence intervals, a procedure involves ranking all
the Walsh averages, at least implicitly, then it will automatically be per-
mutation invariant (independent of the order of the observations), and it is
immaterial whether the Walsh averages are defined in terms of the original
observations Xj or the sample order statistics X(j)' Confidence procedures
for the center of symmetry which are simpler and still permutation invariant,
can, however, be obtained from the sample order statistics by using a small
number of Walsh averages of the form (X (i) + X(j)/2. We cannot choose a
completely arbitrary function of the Walsh averages; for validity under
the assumption that the population is symmetric, the corresponding test
176 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks

should be a signed-rank test and hence should depend only on the signs
of the Walsh averages (Theorem 7.1).
The simplest case would be to use a single Walsh average (X(i) + X(j)/2.
For i = j, the confidence limit is simply an order statistic and the procedure
corresponds to a sign test, as discussed in Chap. 2. For i < j, the confidence
limit corresponds to a sign test on the n + i - j observations which are
largest in absolute value (Problem 86; [No ether, 1973J).
With two Walsh averages there are more possibilities, and a more difficult
probability problem must be solved to obtain the confidence level, but for
sample sizes n s; 15, procedures have been worked out by Walsh [1949a,b].
Specifically, his tests are of the form reject the null hypothesis J1. = 0 (or
J1. s; 0) in favor of the alternative J1. > 0 if both (X(/) + X(j)/2 and (X(k) +
X(I)/2 are positive, and" accept" otherwise, where the four indices i, j, k,
and I are chosen, not necessarily all different, to give the desired level. The
corresponding lower confidence bound is
(7.1)
A lower-tailed test and corresponding upper confidence bound can be
obtained similarly, and combining one-sided procedures gives two-sided
procedures. For 4 S; n S; 15, Walsh [1949a,bJ gives a table of tests of this
form with one-tailed levels near the conventional values 0.005, 0.01, 0.025,
0.05, and two-tailed levels twice these values. He does not define a particular
method of choosing i, j, k, and I for n > 15.
The Wilcoxon procedures with the critical value 0,1, or 2 are of this type.
Specifically, for any n ~ 3 (Problem 36),

T- = 0 if and only if X(1) > 0 (7.2)


T- S; 1 if and only if (X(l) + X(2)/2 > 0 (7.3)
T- S; 2 if and only if min[X(2)' (X(1) + X(3))/2J > O. (7.4)

As a result, fourteen of the procedures in Walsh's table are equivalent to


Wilcoxon procedures, while the remaining thirty are not. The modified
Wilcoxon procedure of Sect. 5 satisfies similar relations (Problem 47),
but happens to be equivalent to a procedure in Walsh's table only when it is
also equivalent to an ordinary Wilcoxon procedure (To = 0 if and only if
T- S; 1).
The levels available using these Walsh procedures are discrete and do
not include exactly the conventional levels. However, Walsh [1949aJ gives
modified procedures which have a conventional level under normality and
bounded level under the original symmetry assumption. The modification
is made as follows. In place of Xli) in (7.1), substitute [aX (h) + (1 - a)X(ilJ
for some h < i, 0 < a < 1. This substitution gives a lower confidence
bound of
min{[aX(h) + (1 - a)X(/) + X(j)J/2, (X(k) + X(/)/2}. (7.5)
8 Invanance and SIgned-Rank Procedures 177

Since X Ch ) ~ XCi), we have always


X Ch ) ~ aX Ch ) + (1 - a)X Ci ) ~ XCi)' (7.6)
Hence the confidence level of (7.5) is between the confidence level of (7.1)
and the confidence level that (7.1) would have if XCi) were replaced by XCh)'
The exact level depends on the form of the distribution sampled. By appro-
priately choosing the value of a, the exact level under normal distributions
can be adjusted to the desired value. The calculation involved will not be
discussed here. When n = 5, for instance, one procedure of this type has
level C( = 0.05 when the Xj are independently, identically, normally distri-
buted, and has level between 0.031 and 0.062 when the observations are
independently, continuously distributed and symmetric about a common
center of symmetry /1.

8 Invariance and Signed-Rank Procedures


The concept of permutation invariance came up briefly in Sect. 7.2 where we
noted that a permutation-invariant test is a signed-rank test if and only if
it depends only on the signs of the Walsh averages. In this section we define
this concept in more detail and for more general one-sample procedures and
discuss the justifications for restricting consideration to permutation-
invariant procedures. Procedures which are invariant under other classes
of transformations are also frequently desirable. Accordingly, we will go on
to show that a strictly increasing, odd transformation on a set of observations
does not change their signed ranks, and conversely that the only procedures
which are invariant under all such transformations are signed-rank pro-
cedures. This provides a possible justification for the use of signed-rank tests
when this type of in variance is considered important (but not for the cor-
responding confidence procedures, as will be explained).

8.1 Permutation Invariance

A procedure ¢(X 1, .•. , Xn) is called permutation invariant in XI' ... , Xn


ifit is unchanged by permutations of the Xj' In other words, ¢ is not changed
if the order of the X's is changed, so that
¢(Xb"" Xn) = ¢(X"" ... , X"J (8.1)
for every permutation 7tb ... ,7tn of 1, ... ,n.
A procedure is permutation invariant if and only if it is a function of the
order statistics X(1), ... , X(n) alone. The order statistics form a sufficient
statistic if the X j are independent and identically distributed under all
contemplated distributions. (For test procedures, this must hold under both
alternative and null hypotheses.) Accordingly, for independent and identic-
ally distributed observations, it follows from the properties of sufficiency
178 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

(Sect. 3.3, Chap. 1) that given any procedure, an equivalent, permutation-


invariant procedure exists. This procedure (possibly randomized) is based
on the order statistics alone and has exactly the same operating charac-
teristics as the given procedure based on the observations and their order.
Thus, for independently, identically distributed observations, restricting
consideration to permutation-invariant procedures is clearly justified
because nothing can be gained by looking beyond them.
If the observations may have different distributions, such a strong jus-
tification of permutation invariance is not applicable because the order
statistics no longer form a sufficient statistic. However, another argument
may apply if it still seems unreasonable to take the order of the Xj into
account. If X I> ••• , Xn provide intuitively the same relevant information
as any permutation X "I' ... , X,,"' then a "reasonable" procedure cp would
satisfy cp = cp", where cp and cp" denote the left- and right-hand sides of (8.1)
respectively. The argument, which we will call "invoking the principle of
in variance (for permutations)," asserts that a procedure cp satisfying cp = cp"
"should" be used, or at least will be. It is just an assumption, and is based on
a rationale which mayor may not apply in any particular problem.
To carry the argument a little further, suppose we have a procedure cp
where cp '" cp". If permuting the Xj would not affect our judgment of the
situation, then we would be indifferent between cp and cp" for all permutations
n. Accordingly, we should also be indifferent between ¢ and the procedure
t/J which consists of choosing a permutation n at random and using cp".
Since the procedure t/J is permutation invariant (Problem 93), corresponding
to any procedure cp there is a permutation-invariant procedure t/J which
seems equally desirable. In this sense, nothing is lost by restricting considera-
tion to permutation-invariant procedures. This argument applies to pro-
cedures for testing, estimation, or anything else.
Now we discuss this argument from a slightly different point of view,
using the context of testing for convenience.
For any joint distribution F of X 1>' •• ,Xn and any permutation nl>' .. ,nn'
let F" be the joint distribution of X "I' ... , X,,". Suppose that for any null or
alternative distribution F, F" is also a null or alternative distribution res-
pectively. Then the level of a test cp" under a null distribution F is the same as
that of a test cp under the null distribution F", and the power of cp" against an
alternative F is the same as that of cp against the alternative F". The power
function of cp" is then the same as that of cp, except for a permutation of the
points in the space on which the power function is defined (the space of
alternative distributions). In addition, the exact (overall maximum) level of
cp" is the same as the exact level of cpo If a power function remains equally
desirable when the alternative distributions are permuted in this way, and
the same is true for the level and the null distributions, then there is no
preference between cp and cp". If cp '" cp", the choice of one procedure over
the other must be arbitrary; a "reasonable" test cp will then satisfy cp = cp".
If this holds for all permutations n, a "reasonable" test will be permutation
invariant.
8 Invariance and Signed-Rank Procedures 179

Consider the procedure t/J above which randomly selects a permutation


11:and uses cp". Then t/J = L" cp,,/n!. If each cp" is as desirable as cp, then t/J is
as desirable as cp and is also a permutation-invariant test. One specific
sense in which t/J is as desirable as cp is that for any F, t/J and cp have the same
average power over the distributions obtainable from F by permutation,
that is
1 1
,L EFJt/J(X 1, ... , Xn)] = ,L EFJCP(X 1,···, Xn)]. (8.2)
n. " n. "
When power is of interest only through such averages, the power of t/J is
as good as that of cpo It also follows from (8.2) that a suitable permutation
invariant test will have certain properties if any test has them. For instance,
if there exists a test which is uniformly most powerful, or uniformly most
powerful unbiased, then there is a permutation-invariant test which has the
same property.
The original argument was that in situations where we feel that any
rearrangement X "I' ... , X "n provides the same information as XI, ... , X n'
we will want to use a procedure that treats them alike, that is, a permutation-
invariant procedure. The argument stated above in the context of testing
changes the emphasis slightly, by saying that if our attitudes toward F and F"
are the same for all F and 11:, then any" reasonable" procedure is permutation
invariant. Two ways in which our attitudes toward F and F" might differ
should be distinguished. First, we might consider an alternative F more
likely than F'" and therefore prefer high power against F and low power
against F" to the reverse. Alternatively, the consequences of an error might
be more severe under F than F", so that the power against F is again more
important than the power against F ". In either case, it might be quite rea-
sonable to prefer some procedure which is not permutation invariant to
any procedure which is. In a more formal framework where such things can
be discussed explicitly, it is appropriate to say that we will be led to per-
mutation-invariant procedures if the prior distribution and loss structure
are both permutation invariant, but generally not otherwise. In any frame-
work, the mere fact that permutations carry null into null and alternative
into alternative distributions is a necessary condition, but by no means a
sufficient reason, to invoke the principle of invariance.

8.2 Invariance under Increasing, Odd Transformations

In the previous subsection, the notion of invariance was discussed


specifically in the context oftransformations which are permutations of the ob-
servations. We gave two possible reasons for requiring a permutation-
invariant procedure, namely, the fact that the order statistics are sufficient
when the observations are independent and identically distributed, and the
"principle of invariance." Permutation invariance is, however, a property
180 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks

of all procedures which are ordinarily considered for the situation at hand.
One justification for restricting consideration to signed-rank procedures is
based on another kind of invariance. The notion of in variance applies very
generally, with similar rationale and limitations, and the "principle of
invariance" can be invoked for any suitable class of transformations. In
this subsection we consider a large class of transformations which leads to a
much reduced and very useful class of invariant procedures, in particular,
to signed-rank tests.
Suppose, for convenience, that XI' ... , X nare independent and identically
distributed, and that we are testing the null hypothesis that the distribution
is symmetric about 0 against the alternative that it is not. Consider the class
of transformations defined by all strictly increasing, odd functions g, where
odd means that
g( -x) = -g(x) for all x. (8.3)
Then if XI' ... , X n satisfy the null hypothesis, so also do g(X d, ... , g(X n),
and similarly for the alternative hypothesis. The transformation in (8.3)
then carries null distributions into null distributions and alternative
distributions into alternative distributions (Problem 97). Accordingly, we
could "invoke the principle of in variance," that is, require that the test
treat Xl' ... , X nand g(X 1), ... , g(Xn) in the same ~ay. If this is required
for all strictly increasing, odd functions g, then any two sets of observations
with the same signed ranks must be treated alike, because any set of observa-
tions can be carried into any other set with the same signed ranks by such a
function 9 (Problem 98). In short, the signed-rank tests are the only tests
which, for these hypotheses, are invariant under all strictly increasing, odd
transformations g.
The signed-rank tests are also invariant under this class of transformations
for some other null and alternative hypotheses we have been considering in
this chapter. The argument above applies to any hypotheses for which
strictly increasing, odd transformations carry null distributions into null
distributions and alternatives into alternatives. This restriction is satisfied
for the null hypothesis that the Xl are independent with possibly different
distributions but all symmetric about 0, and for null and alternative hypo-
theses of the form P(Xj < 0) > P(Xj > 0) or the reverse (as in Sect. 6.1,
Chap. 2), etc. However, it does not hold for alternatives under which the
Xj are symmetrically distributed about a value other than 0 (Problem 101).
Similarly, the confidence procedures for the center of symmetry J,l which
correspond to signed-rank tests are not justifiable by this invariance argu-
ment, because they are not invariant under all strictly increasing, odd
transformations. They are not themselves signed-rank procedures, that is,
they are not functions of the signed ranks of the original observations. The
relevant transformations are different for different J,l.
The argument for in variance under the class of transformations in (8.3)
is far less compelling than the argument for permutation invariance. On
9 Locally Most Powerful Signed-Rank Tests lSI

general grounds, when the class of transformations is too large, as it is here,


it may not be possible to average over it as was done in (S.2). When this
occurs, some test may have optimum properties which no invariant test has.
It can even happen that there is a uniformly most powerful invariant test
which is seriously inadmissible, that is, there exists a noninvariant test which
is uniformly as good and considerably better in a large region [Lehmann,
1959, p. 231]. Thus it would seem that requiring invariance can lead (though
here it does not) to the use of a highly inferior procedure when the class of
transformations is too large. This does not fundamentally invalidate the
"principle of invariance" however, because one's attitude is never literally
invariant in such a situation. The point is that if one is going to use an in-
variant procedure when one has an only approximately invariant attitude,
then one should make sure that there is no non-invariant procedure available
which is significantly better.
This brings us to the real reason why the argument for invariance is not
compelling in the present situation. One might not want to treat X 1, ... ,
Xn and g(X 1}, ... ,g(X n} alike in all instances. For an extreme example,
there is a strictly increasing, odd function g which carries the observations
-1.2, -1.1, -1.0 -0.9, 7.6, 16.7, 24.1, 42.9, 51.0, 83.4
into
-12, -11, -10, -9, 13, 14, 15, 16, 17, IS.
However, one does not feel compelled to regard the two samples as providing
the same evidence concerning the null hypothesis of symmetry about 0;
the first might well be considered much less compatible with this null
hypothesis than the second. (For this it is not necessary to have measurements
on an "interval scale." While a difference of one unit might have varying or
indefinite meaning throughout the measurement scale, a difference of 20
might always be bigger than a difference of 1.)
If such a discrepancy is at all likely, one might prefer not to use a signed-
rank test. However, ordinarily one is quite content to treat alike practically
all samples which have the same signed ranks, with exceptions having small
probability of occurring. Then presumably little or nothing is lost by using a
signed-rank test, and there are indeed some very good tests of this type. By the
"principle of invariance," we can thus justify restricting consideration to
the class of signed-rank tests. Of course, the choice of which signed-rank test
remains.

9 Locally Most Powerful Signed-Rank Tests


This section is concerned with most powerful signed-rank tests. Surprisingly,
problems which admit uniformly most powerful tests do not generally
admit uniformly most powerful signed-rank tests. We can, however, find
182 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

that signed-rank test which is most powerful against any alternative dis-
tribution, and in particular against alternatives of various kinds which are
"close to" the null distribution. We assume that the observations are
independently, identically and continuously distributed so that ties have
probability zero. Also, by sufficiency, we can ignore the order of the observa-
tions and restrict consideration to permutation-invariant tests (Sect. 8.1).
If we did not, they would result anyway. This and some other points which
arise here will be discussed more fully in Chap. 5.
For a sample of size n, there are 2n possible assignments of signs to the
ranks 1, ... , n, and accordingly 2n sets of signed ranks 1'1' .•. , rn , where
rj = ±j. We assume as usual that all 2n possible sets of signed ranks are
equally likely under the null hypothesis. By the Neyman-Pearson Lemma
(Theorem 7.1 of Chap. 1), it follows (Problem 101) that among signed-rank
tests at level IX, the most powerful test against the alternative F rejects if the
probability under F of the observed set of signed ranks is greater than a
constant k, and" accepts" if it is less than k, where k and the probability of
rejection at k are chosen to make the level exactly IX. Letting Pir l , ••• , rn)
be the probability under F of signed ranks 1'1' ..• , r n , we may express this
test as

reject if Pir., ... , rn) > k


"accept" if PF(r 1 , ••• , rn) < k. (9.1)

The most powerful signed-rank test against the alternative F dep~nds,


of course, on F. Even if we restrict consideration to normal alternatives
with positive mean j.l, the most powerful signed-rank test depends on j.l.
If we consider only small positive j.l, however, we will find that there is a cer-
tain signed-rank test which, among signed-rank tests, is uniformly most
powerful against normal alternatives with sufficiently small, positive j.l.
Such a test is called locally most powerful against normal alternatives with
j.l> O.
More generally, consider a one-parameter family of distributions Fo with
densities/o, and suppose that () = 0 satisfies the null hypothesis of symmetry
about 0, that is,fo( -x) = fo(x). Let Sj denote the sign of rj • Then we can
write the probability under FIJ of signed ranks rio ... , rn (Problem 102) as

PIJ(r" ... , rn) = n! f ... f ~ !o(SiYi) dYl ... dYn· (9.2)


0<" <···<,"<00

Assume it is legitimate to differentiate (9.2) under the integral sign, and let

(9.3)
9 Locally Most Powerful SIgned-Rank Tests 183

Then the derivative of (9.2) at () = 0 is

(9.4)
where IX 1(1) < ... < IX I(n) in (9.4) are the absolute values of a sample of n
from F 0' arranged in order of size. Expanding Po(r" ... , rn) in a Taylor's
series about () = 0 and using (9.4) for its derivative, we have

Pir" ... , rn) = 2- n{ 1 + () t Eo[h(sjl X I(j)] + smaller order terms}-


(9.5)
Substituting this in (9.1), it follows that, for sufficiently small () > 0,
the most powerful signed-rank test has the following form:
n
reject ifLEo[h(sjIXI(j))] > k
, (9.6)
n
"accept" if L Eo [h(sj IX(j) I)] < k.
,
This is equivalent to a test based on a sum of signed constants Cj (Problem
104) where
Cj = Eo[h(IXI(j) - h( -IXI(j))]. (9.7)
It follows that a test of the form
n
reject if L SjCj ;;::: k
1
(9.8)
"accept" otherwise
is locally most powerful against F II, () > 0, among signed-rank tests at the
same level oc. If the level oc desired is not attainable with a test of the form
(9.8), then randomization may be necessary at k, and if also more than
one set of signed ranks gives the critical value of Ii SjC], then higher order
terms in (9.5) will be required to determine the locally most powerful test
(but not to maximize the derivative of the power at = 0). e
Similar statements hold for 0 < 0, with rejection when Ii SjCj is too small.
For normal alternatives with mean Jl. = () and fixed variance (f2, from
(9.3) we have (Problem 105) hex) = X/(f2 and thus
c] = 2E( IX 1(j)/(f2 (9.9)
where IX 1(1) < ... < IX I(n) are the ordered absolute values of a sample of
n from the normal distribution with mean 0 and variance (f2. A test of the
form (9.8) based on these Cj is equivalent to one based on
cj = E(IZI(j) (9.10)
184 3 One-Sample and Paired-Sample Inferences Based on SIgned Ranks

where IZI(1) < ... < IZI(n) are the ordered absolute values of a sample of
n from the standard normal distribution. Accordingly, this test is the locally
most powerful signed-rank test against normal alternatives with positive
mean. The corresponding lower-tailed test is similarly the locally most
powerful signed-rank test against normal alternatives with negative mean.
The test with the Cj in (9.9) (or, equivalently (9.10», is frequently referred
to as the Fraser (normal scores) test since it was derived by Fraser [1957a].
Additional properties, as well as the values of the scores in (9.10) and the
critical values, are given in Klotz [1963]. This test is asymptotically equi-
valent to a test with "inverse normal scores" as constants, that is, Cj =
CI>-lU/(n + 1», where CI>(x) is the standard normal c.dJ. The values of these
scores are more readily accessible than those in (9.10); e.g., Fisher and Yates
[1963] and van der Waerden and Nievergelt [1956]. This test is mentioned
in van Eeden [1963].

F(x)
1.0

0.8

0.6

0.4

0.2

f(x)

Figure 9.1
Problems 185

Consider next the logistic distribution with


1 e-~-~

Fo(x) = 1+e (x-O)' fo(x) = [1 + e (x 0)]2' (9.11)

This distribution is shown in Fig. 9.1; it is very close to a normal distribution.


For the logistic distribution, we have (Problem 105) h(x) = 2F o(x) - 1 and
Cj = 2j/(n + 1). (9.12)
The tests of the form (9.8) based on the Cj in (9.12) are exactly the upper-
tailed (nonrandomized) Wilcoxon signed-rank tests, which are therefore
locally most powerful among signed-rank tests against the logistic alternative
distributions in (9.11) where 0 > O. The lower-tailed test has a similar
property for 0 < O. An arbitrary scale factor (J would not alter this property.
The argument leading to (9.8) shows that any locally most powerful
signed-rank test is a one-tailed test based on a sum of signed constants,
subject to the qualification following (9.8). It can also be shown that any
test of the form (9.8) is locally most powerful against some alternative
(Problem 96, Chap. 5). A more difficult problem is to determine which cj
provide locally most powerful tests for some restricted alternatives like
Fo(x) = Fo(x - 0) or Fo(x) < Fo(x) for 0> O. This problem will not be
discussed here, but see Problem 103, Chap. 5.

PROBLEMS
1. Show that a distribution with discrete frequency functionfis symmetnc as defined
by (2.1) or (2.2) if and only iff satisfies (2.3).
2. (a) Show that the condition in (2.1) is equivalent to the same condition with
stnct Inequahtles, that IS, P(X < Jl - x) = P(X > Jl + x) for all x.
(b) Show that (2.1) holds for all x If and only ifit holds for all x> O.
3. Suppose that X is symmetrically distributed about Jl. Show that
(a) Jl is a median of X.
(b) Jl is the mean of X (provided it exists).
(c) Jl is the mode of X if X is unimodal.
4. Show that each of the symmetry conditions given as (a), (b), and (c) in Sect. 2 is
equivalent to the condition that X is symmetrically distributed about Jl.
*5. Show that a distribution is symmetric about 0 if and only if its characteristic
function is real everywhere. (Hint: Each is equivalent to the statement that X and
- X have the same distribution.)

6. Let V and W have the joint density

{-l~W~V~l'V-W~l;
f( v, W) = -1 r
lor
2 O~v+l~w~l.

(a) Show that the marginal distributions of V and Ware both the uniform density
over (-1,1), and hence the medians of V and Ware each equal to O.
186 3 One-Sample and PaIred-Sample Inferences Based on SIgned Ranks

(b) Show that P(W < V) = t, which implies that the median of the population
of differences X = W - V must be negative and hence cannot equal the
difference between the medians of Wand V.
(c) Show that the density of the difference X = W - V is
(2 + x)f2
for -1 < x ~ 0
f(x) ={
(2 - x)/2 for 1 < x ~ 2.
This distribution is not symmetnc and does not have median O. It has a umque
median -2 + j3.
7. Suppose that V and Ware independent and X = W - V. Show that the following
properties hold. Note that (a)-(c) gIve condItions under which the difference X
is symmetrically distributed about a median which equals the difference of the
medians of Wand V, while (d) shows that, even if X is symmetrically distributed,
its center of symmetry may not be equal to the difference of the medians of Wand V.
(a) If V and Ware symmetrically distributed about J1 and A. respectively, then X is
symmetrically distributed about A. - J1.
(b) If V and Ware identically distributed, then X is symmetrically distributed
about O.
(c) If W has the same distribution as V + 8 for some "shift" 0, then X is sym-
metrically distributed about 8.
*(d) If V and W have any two distributions of the following family, then X is
symmetrically distributed about 0, even though the medians of V and W may
differ. The family of distributions is discrete with frequency functions fo given
by
fo(l) = 28, 10(2) =~- 38, fo(3) =!, 10(4) = 0,
all for 0 ~ 8 ~ ~. The median is uniquely 2 for 0 < !, uniquely 3 for 0 > !.
What is the median for 8 = !?
8. Define the difference X = W - V where Wand V need not be mdependent. The
example in Problem 6 shows that in general, the median of X need not be equal
to the difference of the medians of Wand V even if the marginal distributions
of V and Ware identical and symmetric. Show that in the following situations,
the median of the difference is equal to the difference of the individual medians,
by showing first the results stated.
(a) If (V, W) has the same distribution as either (w, V) or (- V, - W), then X is
symmetrically distributed about 0 (and the medians of V and Ware equal).
(b) If the distributions of V, W, and X are each symmetric (with finite means), then
the center of symmetry of X is equal to the difference of the centers of symmetry
of Wand V.
9. For a set of n independent observations from a population whIch is continuous and
symmetric about zero, show that the signs of the signed ranks are mutually inde-
pendent and each is equally likely to be positive or negative. Show that the signed
ranks themselves are dependent if the anginal order ofthe observatIOns is retained.
to. Show that T+ and T- have identical null distributions.
11. (a) If un(t) denotes the number of subsets of the first n integers whose sum is equal
to t, show that
Problems 187

for all t = 0, 1, ... , n(n + 1)/2 and all positIve n, with the following initial
and boundary conditions:
II n(t) =0 for all t < 0
uo(O) = 1
uo(t) =0 for all t =1= 0
IIn(t) = 0 for t > n(n + 1)/2.
This provides a simple recursive method for generating the frequencies of
values of T+ and hence the null probabihty function poet) = P(T+ = t) for
samples of size II uSlllg

2p,,(t) = Pn-I(t - II) + Pn-I(t).


(b) What change is required in order to generate directly the null cumulative
probabilities F,,(t) = P(T+ ~ t)?

12. Use the recursive method developed in Problem 11 to generate the complete
null distribution of T+ for all II ~ 4. Check your results against Table D.

*13. Define ""(t) as in Problem 11 and let u(t) be the number of subsets of all positive
integers whose sum is equal to t. Show that
(a) un(t) = u(t) for t ~ 11.
(b) The number of subsets of all positive integers with sum t and maximum III
is um- I(t - /II).
(c) un(t) = u(t) - I;-;,o un+, (t - II - 1 - i), where the sum actually terminates
because the summand vanishes for i > t - II - 1. (Hint: The terms in the
sum count the subsets with sum t and maximum II + 1 + i.)
(d) u,,(r) = u(t) - I/(I)(t - n - 1) for t ~ 211 + 1, where u(l)(s) = Ir;o II(S - i).
(e) IIn(t) = lI(t) - u(l)(t - II - 1) + I;-;,o I;;. 0 un+,+it - 2n - 2 - 2i - j).
(f) un(t) = u(t) - I/(I)(t - II - 1) + U(2)(t - 211 - 2) for t ~ 311 + 2, where IP)(S)
= I;-;,
0 1/(1 )(s - 2i).
(g) 1I,,(t) = I~o (-lill(k)(t - kll - k) where dOles) = u(s), and

U<k)(S) = I U(k - I)(s - ki).


,=0

Both sums term1l1ate because IIlkl(S) = 0 for s < O. (For a complete, formal
proof, it may be convenient to introduce 1I:,k)(t) = I;-;,
0 u~\-,I)(t - ki) and
prove by 1I1duction on k that u:~)(t) = lI(k)(t) - 1I~k+ I)(t - II - I).)
(h) All of the foregoing equalities hold for U"(r) = I:=oun(i) = the number of
subsets of {I, 2, ... , II} with sum at most t, if U(k) is replaced by U(k) where
U(k)(t) = Ir;o U(k-l)(t - ki) and V(O)(t) = Vet) = I:=o 1I(i) = 1I(1)(t). How
IS (b) Illterpreted III thIS case?
(i) The null probability function and c.dJ. of T+ satisfy
00

P"(T+ = t) = 2- n I (_l)kll(k)(t - kll - k),


k=O

00

P,,(T· ~ t) = r" I (- l)kU(k)(t - kll - k).


k=O
188 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

Note: Instead oftabulating the null distribution directly, one could tabulate the
functions U(k} (for point probabilities) or U(k} (for tail probabilities). The total
number of lower tail probabilities less than 0.5 for sample sizes 1, ... , n is
[[n(n + 1)(n + 2)/12]] - [[(n + 2)/4]] where [[x]] denotes the largest
integer not exceeding x. The number of function-values U(k}(t) required to
cover the same range is [[(n + 4)/4]]{[[n(n + 1)/4]] - [[n/4]](n + 1)/2}.
For large n, this is i as large a number, and the values of t covered are covered
for all sample sizes. The tabulation required could be reduced still further by
omitting U(k} for alternate values of k (odd or even) and using U(k-I)(S) =
U(k)(S) - U(k)(S - k). Tables of the functions U(k) and U(k} are easily generated
for successive k from their definitions and a table of u. The function u is a well-
studied partition function and is tabled in National Bureau of Standards
[1964, Table 24.5] where further references may be found. It can be generated
un(t))
recursively (without need for from the nontrivial relation

u(t) = J'<_1)k-I[U(t - 3k2; k) + u(t _ 3k 22+ k)] + set)


where u(t) = 0 for t < 0, u(O) = 1, and set) = (-1)' if t = 3r2
± r for some
integer r, set) = 0 otherwise. Alternatively, for tail probabilities, U(t) can be
generated recursively from

U(t) = k~l (_I)k-l [~(t - 3k 2 2- k) + u(t - 3k22+ k)] + Set)


where U(t) = 0 for t < 0, U(O) = I, and Set) = (-1)' if 3r + ,. ::s;
2 II <
3(r + 1)2 - r - 1 for some integer r, Set) = 0 otherwise. This relation follows
from the previous one.
14. Derive the mean and variance of T as given in (3.7) and (3.8) by using the fact
that under the relevant null hypothesis, T is a sum of the first n integers with factors
+ I and - I attached at random.
15. Show that T+, T-, and T have symmetric null distributions.
16. Show that for samples with n nonzero observations and no ties, the null probability
distribution of T+ can be written as

P(T+ = t) = ±
k=O
(n)rnp(T+ = tiS = k)
k
where S denotes the number of positive observations. This representation might be
useful for systematic generation of the null distribution of the signed-rank statistic.
Further, P(T+ = tiS = k) is a null probability for the Wilcoxon two-sample
statistic (covered in Chap. 5) where the positive observations are interpreted as
from one sample and the negative from another. Problem 17 gives further insight
into the relationship between the one-sample and two-sample statistics.
17. Let D I , . •• , DN be a sample of N nonzero observations and define X, or Y; for each
iby

D. = { -Y;
X,
ifD, >0
, ifD, < O.
Problems 189

Assume there are m X values and n Y values, where m + n = N.


(a) Show that the signed-rank test statistic T+ calculated for these D, is equal to
the sum of the ranks of the X observations in the combined ordered sample of
m X's and nY's.
(b) Show that T+ - T- IS the sum of the ranks of the X's minus the sum ofranks
of the Y's in the combined ordered sample. The sum of the ranks of the X's
is the test criterion for the Wilcoxon two-sample statistic to be discussed in
Sect. 4, Chap. 5. Show that T+ might be used to test the null hypothesis that
the distributions of X and Yare identical, and relate this to a test based on T+
for the null hypothesis that the center of symmetry of the D population equals O.
18. (a) Prove Theorem 3.1, concerning the relation between Walsh averages and
signed ranks.
(b) Show that, if there are no zeros or ties, T = T+ - T- can be written

L L sign(X, + Xi)
1 slsjsn

where

. { 1 if X > 0
slgn(X) = _1 if X < O.

19. (a) Show that the possible values of T = T+ - T- are alternate integers between
-n(n + 1)/2 and n(n + 1)/2 inclusive.
(b) For what values ofn are the possible values of T even?
20. (a) Show that the continuity correction to the approximate normal deviate for
the Wilcoxon test is [6/II(n + 1)(2n + 1)]1/2.
(b) Show that this correction is less than 0.02 if (and only if) n ;;:: 20, less than om
if (and only if) II ;;:: 31.
(c) Show that the corresponding values for the sign test (of an hypothesized
median) are l/n1/2, 2501, and 10001.
*21. Show that T+ and T- are asymptotically normal by using the fact that T is
asymptotically normal.
*22. Show that L'i R J , as defined in Sect. 3.2, satisfies the Liapounov criterion.

23. Show that the standardized statistics [T+ - E(T+)]fJvar(T+) and [T-
E(T)]/Jvar(T) are identical in value as long as the means and variances are
calculated under the same distribution.
24. Verify the moments of the T.j' which are given in the proof of Theorem 3.2.
25. Under the null hypothesis of symmetry about 0,
(a) Show that the probabilities defined in Theorem 3.2 have the values

PI = i, pz = i,
(b) Verify that the expressions given in (3.12) and (3.15) for the mean and variance
of T+ reduce correctly to (3.5) and (3.6).
26. Use the method of Problem 1 of Chap. 1 to show that 2T+ /n2 is a consistent
estimator of P2'
190 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

27. (a) Consider the sign test of Chap. 2 for an hypothesized median based on an
independent random sample. Against what alternatives is it consistent?
(b) Give an example of an alternative against which the sign test is consistent but
the Wilcoxon test is not, and vice versa.
28. Show that the Wilcoxon test based on an independent random sample from a
symmetric distribution is consistent against shift alternatives.
*29. Suppose that IE(Zn) I < Band var(Zn) s B for all n and all null distributions.
Use Chebyshev's inequality to show that an equal-tailed test based on Z" is con-
sistent against any alternative under which IE(Zn)l-> 00 and var(Zn) is bounded.
30. Show that if X I' ... , X" are independent with distributions which are continuous
and symmetric about /lo, then a Wilcoxon test for center of symmetry /lo has the
same level irrespective of whether the distributions are identical or not.
31. If the conditional distribution of X 1 given X 2, ... , X" is symmetric about 0,
show that the conditional probability that X I > 0 equals the conditional prob-
ability that X I < 0 given the signs of X 2, •.. , X".

32. Consider a matched-pairs experiment with the null hypothesis that the treatment
actually has no effect. Show that randomization validates the null distribution
of the Wilcoxon test statistic defined on treatment-control differences.
33. (a) Let X have continuous c.dJ. F and let G(x) = ! + ![F(x) - F( -x)]. Show
that G is the c.dJ. of a symmetric distnbution and is stochastically larger than
F if and only if P(X < -x) ~ P(X > x) for all x ~ O.
(b) Generalize (a) to allow discrete distributions.
*34. (a) If X has c.dJ. F and F(x) ~ G(x) for all x, show that there exists a random
variable Y with c.dJ. G such that P( Y ~ X) = 1.
(b) If X I, ... , X" are independent, X) has c.dJ. F) and Fix) ~ Gix) for all x andj,
show that there exist independent random variables Y, such that Y, has c.dJ.
G) and P(Y, ~ X) = 1 for allj.

35. Use Theorem 3.3 to show that a suitable one-tailed Wilcoxon test rejects with
probability at most IX under (3.21) and at least IX under (3.22).
*36. Let X I, •.• , X n be independent observations on a continuous distribution which is
symmetric about /l.
(a) Show that for any n ~ 3, we have

P(X(1) > J.I) = rn


P[(X(I) + X(2)/2 > /l] = 2(2-")

P[X(2) > /l and (X(I) + X(3))/2 > /l] = 3(2-").


These results give lower confidence bounds at the respective confidence levels
1 - r", 1 - r(n-I), 1 - 3(2-") for any sample of size n ~ 3.
(b) Show that these confidence bounds correspond to the Wilcoxon test with
critIcal values 0, 1, 2 respectively.

*37. Consider the Walsh averages Wi) = (X(I) + X(j)/2, for i S j, defined in terms of
the order statistics X(I)" .. , X(n) of a set of n observations.
Problems 191

(a) Note that always W Il S; W12 S; all other Wi), What other inequalities always
hold between w,j with i S; j S; 4?
(b) Recall that the three smallest w,j are WIl , W12 , and min{W22' Wd. Which
w,J can be fourth smallest for some data sets? (There are three possibilities.)
(c) Show that the minimum possible rank of w,J among the Walsh averages is
ti(2j - i + 1).
(d) Show that the maximum possible rank of WI/ among the Walsh averages
IS tj(j - 1) + 1. What is the maximum possible rank of WiJ for i :?: 2?
(e) Which w,J can be fifth smallest for some data sets (four possibilities)? Sixth
smallest (six possibilities)?
(f) Show that the fourth smallest w,j is min{WI4' max(W22 , W13 )} = max{W13 ,
min(W22 , WI4 )}·
(g) Show that the fifth smallest w,j is min{WIs , W23 , max(WI4 , W22 )} =
max{min(WI4' W23 ), min(WIs , W22 )}.
(h) Show that the fourth and fifth smallest WiJ are confidence bounds correspond-
ing to one-sided Wilcoxon tests at level 0( = 5/2n and 7/2n for n :?: 5.
38. Show that the modified Wilcoxon test statistic To of Sect. 5 has the same null
distribution In a sample of size n as T - in a sample of size n - 1.
39. For a sample with neither zeros nor ties, show that
(a) T+ = Tri + S where S is the number of positive observatIOns. (This result
relates the Wilcoxon statistic, the modified Wilcoxon statistic and the sign
test statistic.)
(b) Tri is the number of positive Walsh averages (X, + X}/2 with i < j.
40. Verify the expressions given in (5.1) and (5.2) for the mean and variance of T ri .
41. Show that 2Tri In(n - 1) is the minimum variance unbiased estimator of P2 for
the family of all continuous distributions.
42. With the definItions of Sect. 3.3, show that P4 S; P2 for all distributions.
*43. (a) Show that the suggestions following (5.3) lead to inequalities of the form
(P2 - P2)2 S; C + 2Bp2 - Ap~ where P2 = 2Tri In(n - 1) and A, B, Care
nonnegative constants.
(b) Show that the corresponding confidence region is an interval with endpoints
{P2 + B ± [(P2 + B)2 - (P2 - C)(1 + A)]I/Z}/{l + A},
except that it is empty if the quantity in brackets is negative (which is impos-
sible if P4 is replaced by the upper bound PZ and extremely unlikely if it is
estimated as described following (5.3».
*44. Assuming that the asymptotic distribution of [Tri - E(Tri)]/Jyar(Tri) is
standard normal, show that P{ITri - n(n - l)p2/21 S; zjV) -> 1 - 20( if z is
the upper 0( point of the standard normal distribution and V is an estimator of
var(Tri) satisfying V/n3 -> P4 - pi in probability. This provides an asymptotIc
confidence interval for pz.
45. For the data In Sect. 2, verify the P-values, confidence bounds and confidence
levels given in Sect. 5 for the Wilcoxon and modified Wilcoxon procedures.
46. Show that To = 0 if and only if T- S; 1. Thus the modified Wilcoxon test with
critical value 0 is equivalent to the ordinary Wilcoxon test with critical value 1.
(Otherwise the tests are not equivalent.)
192 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

*47. Modifying Problem 37, consider only those Walsh averages w,J with i < j.
(a) Show that the smallest three w,J with i < j are Wl2 , W13 and min{Wn, WI4 }.
(b) Show that the minimum and maximum possible ranks of the W;j among
those with i < j are the same as those of w,.r I in Problem 37.
(c) Show that the formulas for the ordered W;j in Problem 37 apply here if the
second subscnpt is increased by 1 throughout. For instance, (a) is so related
to 37(b). Similarly 37(d) gives that the fourth smallest W;J among those with
i <j is min{WIs , max(W23 , WI4 )} = max{WI4 , min(W23 , W2S )}'
(d) Show that the first five ordered w" with i < j are confidence bounds cor-
responding to one-sided modified Wilcoxon tests at levels 2/2",4/2",6/2",10/2",
and 14/2" respectively for n ~ 6. In particular, To : ;
0, 1, or 2 respectively if
and only if 0 < Wo , W13 , or min{W23' WI4 }.
48. For the data in Sect. 2 use procedures correspondmg to the tests based on T+
and T;j to find upper and lower confidence bounds for II, each at level approxi-
mately 0.025, by applying the methods of interpolation between attainable levels
(explained in Sect. 5, Chap. 2).
*49. In order to investigate the effect of interpolating halfway between two adjacent
order statistics of a random sample to find a confidence bound for the population
median JI, note that the true level of the interpolated confidence bound is
P[(X(.) + X(.+ 1))/2> JI] = P(X. > II) + pP(X(.) ::;; JI < X(.+ 1))
= (1 - p)P(X. > JI) + pP(X(.+ I) > It)
where
p = P(X(i+I) - JI > JI - X(.)IX(.)::;; JI < X(.+1))'
Linear interpolation approximates p by t. Show that, for a continuous, symmetric
population,
p = P(RI = -lIS- = i) = i/II
where RI is the first signed rank and S- is the number negative among the values
X. - JI. Thus, linear interpolation overestimates the error probability (for one-
tail probabilities below 0.5).
*50. (a) In order to investigate the effect of interpolating between the two smallest
(or largest) Wilcoxon confidence limits, show that for II observations on a
denSity J,
per) = P[(1 - r)X(1) + r(X(I) + X(2))/2 > 0]
= n(n - 1) If f(x)f(y)[1 - F(y)]"- 2 ely

where the region of integration is (1 - r/2)x + ry/2 > 0, x < y. Show that
for 0 ::;; r ::;; 1, this reduces to

11(11 - 1) {" f(Y{F(Y) - F(2-~~') J[1 - F(y)]"-2 ely

[I - F(O)]" for r = 0
{
= 2/2" for,. = 1 and f symmetric about O.
Problems 193

The accuracy of linear interpolation depends on how linear the integral is,
and hence on the behavior of F[ - ry/(2 - r)], as a function of r, 0 :s; r :s; t.
(b) Show that, for the uniform distribution, F[ -ry/(2 - 1')] is a concave function
of,' in the relevant range, and hence Per) is convex.
(c) Show that, for the standard normal distribution, F[ - ry/(2 - 1')] is a concave
function of r for 0 :s; I' :s; 2 - y2 and hence for 0 :s; ,. :s; 1 and 0 :s; y :s; 1.
(Values of y > 1 contribute relatively little to the integral above, since both
the first and last terms in the integrand decrease rapidly as y increases above 1.)
These results suggest that the tail probability per) tends to be convex in I' and
hence to be overestimated by linear interpolation.
51. Show that, if L is the (k + l)th smallest Walsh average of a sample from a distribu-
tion which is symmetric about /-l, then P(L:s; Ji) ~ 1 - IX ~ peL < It) where
I - IX is the exact confidence level in the continuous case. (Hint: What confidence
region corresponds to the randomization method of breaking ties?)

52. (a) Show that the null mean and variance of T+, based on n nonzero observatIOns
with ties and calculated using the average rank procedure, conditional on the
ties observed, are

E(T+) = n(n + 1)/4


var(T+) = [n(n + 1)(2n + 1) - L t(t 2 - 1)/2]/24.
where t is the number of observations tied for any given rank, and the sum is
over all distinct sets of tied ranks for any t > 1. (L t(t 2 - 1) could be written
as LI(t~ - 1) where i ranges over all observations and t, is the number of
observations tied with observation i, including itself.) The same result holds
if zeros are present if they are omitted and n is reduced accordingly.
(b) If zeros are included for the ranking but given signed-rank zero, show that the
null mean and variance are

E(T+) = n(n + 1)/4 - v(v + 1)/4


var(T+) = [n(n + 1)(2n + 1) - v(v + 1)(2v + 1) - L t(t 2 - 1)/2]/24
where v is the number of zeros and the sum is over all sets of nonzero ties.
(c) Show that the null distribution of T+ is symmetric.
(d) Show that T+ + r- = n(n + 1)/2 - v(v + 1)/2.
53. List all possible ways of breaking ties for the data in (6.2) and verify Table 6.3.
54. Show that, under the null hypothesis of symmetry, the random method of breaking
ties preserves the usual null distribution of the Wilcoxon statistic (even condition-
ally on the absolute values observed).

55. Show that 0 IS an endpoint of the usual Wilcoxon confidence interval if and only
if the ordinary Wilcoxon test would reject by one method of breaking the ties but
not by another.

56. Verify that the (k + l)th smallest Walsh average is 0 for 7 :s; k :s; 11 for the data
in (6.2).
57. Verify the results given after Equation (6.5) for the left tail of the null distribution
of T- for n = 8 by the average rank procedure, given a tie at ranks 1,2,3, and 4.
194 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

58. For the data 1, 1, 1,2,2,2,2, - 3, show that


(a) By the average rank procedure, the exact P-value is 25/2 8 and the next P-value
(the probability of a strictly smaller value of T-) is 24/2 8
(b) However the ties are broken, the exact P-value is 25/2 8 and the next P-value is
19/28 •
(c) For one-tailed oc = 24/2 8 , the sample is not significant by the average rank
procedure, but after any tiebreaking it would be significant with probability
i by an exact randomized Wilcoxon test. (Thus this sample exemplifies the
reverse of (6.3).)
*(d) What are the corresponding confidence regions?
*59. Show that, if there are no nonzero ties, the signed-rank zero procedure satisfies
the conditions (i)-(iii) of Sect. 6.5. (Hint: Show that To :s; T- :s; To + v(v + 1)/2
where To is obtained by the signed-rank zero procedure and T- by breaking ties
randomly and v is the number of zeros. Use this to relate the critical value of Toto
the usual critical values.)
60. Show that the average rank procedure in conjunction with the signed-rank zero
procedure satisfies the conditions (i)(b) and (ii) of Sect. 6.5. (Hint: Show that if all
observations are increased equally, the signed-rank sum of the actual sample
increases at least as much as that of the sample obtained by any reassignment of
signs.)
61. For the data 0, -1, -2, -3, -4,5,6,7,8,9,10, ll, show that
(a) By the reduced sample procedure, the null probability of a negative-rank
sum as small as or smaller than that observed is 43/2 11 = 0.0211.
(b) If the zero is retained and given the signed rank + 1, the null probability of a
smaller negative-rank sum is 87/2 12 = 0.0212, which neverthless exceeds the
value in (a).
*(c) For43/2 11 :s; oc :s; 87/2 12 , by the reduced sample procedure, the sample is signi-
ficant in the positive direction, but becomes not significant when every observa-

°
tion is increased by the same small amount. The corresponding confidence
region for J1 is an interval with the interior point removed.
62. Show that for the requirements to avoid anomalies in the presence of zeros and
ties given in Sect. 6.5, condition (i)(b) holds if and only if condition (ii) holds.
*63. Show that, for the procedure of breaking the ties at random, either actually using
standard tables or reporting the probability with which doing so would lead to
rejection. conditions (i). (ii) and (in) and also (i') of Sect. 6.5 hold.
64. In order to show that, despite intuition, larger observations are not always greater
evidence of positivity, consider the density

OA -l<x<O,
{
f(x) = ~.8 - OAx ° :s; x < 1,
otherwise.

(a) Show that this is a "positive" density, sincef(x) > f( -x) for 0< x < 1 and
f(x) ~ f( -x) for all x> 0.
(b) Show that if a sample of size 2 is drawn from a population with this density,
the signed ranks 1, - 2 are more likely than the signed ranks -1,2.
Problems 195

65. (a) Show that increasing one observation in a sample may decrease the ordinary t
statistic.
(b) More generally, show that t is a decreasing function of XI for XI > L~ x~ /L~ X,
if L~ X, > 0, and that t -> 1 as XI -> 00.

*66. Show that, if the Wilcoxon statistic is computed for each possible way of breaking
the ties (as in Sect. 6.3), the simple average of the resulting values is numerically
equal to the statistic obtained by the average rank and signed-rank zero procedures,

67. For the data -1, -2, -3, -4,5,6,7,8,9,10,11,


(a) Show that by the Wilcoxon signed-rank test the one-tailed P-value for this
sample is 43/2 11 , so that the sample would be judged significantly positive at all
levels (J. 2 43/211.
(b) Suppose that an additional observation is obtained and its value is 0.5. Show
that the signed-rank procedure for the new sample gives a next P-value of
87/212. Hence this sample would lead to a conclusion of not sigmficantly
positive for 43/2 11 = 86/2 12 ~ (J. ~ 87/212. In other words, the addition of a
positive observation to a significantly positive sample may make it not sig-
nificant by the Wilcoxon signed-rank test.
(c) Show that the ordinary t test, based on normal theory, gives a one-taIled
P-value of 0.0165 for the original data, and that an additional positive observa-
tion, if small enough, decreases the value of t and hence increases the P-value.
Thus the t test has the same property.

68. For the data 0, 0, - 2, - 3, - 5,6,9,11, 12, 15, 16, and the negative-rank sum as test
statistic,
(a) Show that, by the signed-rank zero procedure, the exact P-value is 23/2 9 and the
next P-value is 19/29 .
(b) Show that, by the reduced sample procedure, the exact P-value is 14/2 9 and
the next P-value is 10/2 9 •
(c) If the zeros are given signed ranks ± 1, ±2, what are the possible P-values?
(d) Do the results in (c) agree or disagree with those in (a) and (b)?

*69. The modified Wilcoxon procedure of Sect. 5 agrees with the reduced sample
procedure for the data given in (6.7) and in Problem 61. Why is the modified
procedure not subject to the same objections?

*70. Construct examples showing that, for the modified Wilcoxon procedure of Sect. 5,
(a) If ties are handled by the average rank procedure, neither condition (i)(a)
nor (iii) of Sect. 6.S need hold.
(b) If zeros are handled by the reduced sample procedure, none of the conditions
(i)-(lII) of Sect. 6.S need hold.

71. Show that the Wilcoxon statistic, calculated using the average rank procedure and
including the zeros in the ranking but giving them signed-rank zero, can be written
as

T = LLsign(X, + X)
15:)

where sign (0) = o.


196 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

*72. Show that the one-tailed Wilcoxon test in the appropriate direction using the
average rank and signed-rank zero procedures for ties and zeros is consistent
against alternatives for which the X, are independent and P(X, + X J > 0) -
P(X, + Xj < 0) is at least some fixed amount for all i of. j, while the "conservative"
procedure is not.
73. A manufacturer of suntan lotion is testing a new formula to see whether it provides
more protection against sunburn than the old formula. Ten subjects are chosen.
The two types of lotion are applied to the back of each subject, one on each side,
randomly allocated. Each subject is then exposed to a controlled but intense
amount of sun. Degree of sunburn was measured for each side of each subject, with
the results shown below (higher numbers represent more severe sunburn).

Subject Old Formula New Formula

1 41 37
2 42 39
3 48 31
4 38 39
5 38 34
6 45 47
7 21 19
8 28 30
9 29 25
10 14 8

(a) Test the null hypothesis that the difference of degree of sunburn is symmetric-
ally distributed about 0, against the one-sided alternative that the new formula
IS more effective than the old. Use a Wilcoxon test at level 0.05, handling ties by
the average rank procedure and using Table D as an approxImation.
(b) Compute the exact P-value by generating the appropriate tail of the distribu-
tion using average ranks.
(c) Find the range of P-values which results when the ties are broken.
(d) Do (b) and (c) of this problem always lead to the same decision when IX = 0.05?
Find the range of IX for which the decisions are the same.
(e) Find a 90% upper confidence bound for the median difference assuming that
the distribution of differences is symmetric.
74. For the data given in Problem 73, use the sign test procedure of Chap. 2 to
(a) Find the P-value for testing the null hypothesis that the median difference is O.
(b) Find an upper confidence bound at level 0.90 for the median difference.
75. The Johnson Rod Mill Company produces steel rods. When the process is operat-
ing properly, the rods have a median length of 10 meters. A sample of 10 rods,
randomly selected from the production line, yielded the results listed below.
9.8, 10.0, 9.7, 9.9, 10.0, 10.0, 9.8, 9.7, 9.8, 9.9
Does the process seem to be operating properly? How would you recommend
handling the tIes?
Problems 197

76. The Brighton Steel Works orders a certain size casting in large quantities. Before
the castings can be used, they must be machined to a specified tolerance. The
machining is either done by the company or is subcontracted, according to the
following deciSIOn rule:
"If average weight of casting exceeds 25 kilograms, subcontract the order for
machining. If average weight of castings is 25 kilograms or less, do not subcontract."
The company developed this decision rule in an effort to reduce costs, because the
weight of a casting is a good indication of the amount of machining that will be
necessary while the cost of subcontracting the castings is a function of the number
of castings to be machined rather than the amount of machining required by each
casting.
The following data are for a random sample taken from a lot of 100 castings.
Casting 2 3 4 5 6 7 8
Weight 24.3 25.8 25.4 24.8 25.2 25.1 25.5 24.6
(a) What decision is suggested by the Wilcoxon signed-rank test at level 0.05?
(b) What assumption of the Wilcoxon test is critical here?
(c) What do you think of this method of making a decision?

77. The manufacturers of Fusion, "a new toothpaste for the post-atomic age," hired
Anonymous Unlimited, an independent research organization, to test their
product. Anonymous Unlimited induced children to go to the dentist and have
their cavities counted and filled, and then to switch from their regular brand to
Fusion. A year later they went to the dentist again. Advertisements blared the
astounding news: 87.5% had fewer cavities.
The actual data were as follows:

Child number 234 5 6 7 8

Cavities using regular brand 4 2 6 4 3 4


Cavities using Fusion 2 3 3 202

Apply to these data the statistical methods you consider most applicable and
comment on your choice of methods. What conclusions can be drawn from the
experiment, under what assumptions, and with what reservations? How could the
experiment have been improved (without changing its scope greatly)? Be brief.
78. A sail-maker wanted to know whether the sails he makes of dacron are better than
the sails he makes of cotton. He made 5 suits of dacron sails and 5 suits of cotton
sails, all for a certain type of boat. He obtained 10 boats ofthis type, labeled A, B, . .. ,
J, and had them sail In two races. He picked 5 of the 10 boats at random; these 5
(they were A, C, E, G, and H) used dacron salls in the first race and cotton sails
in the second race. The other five (B, D, F, I, and J) used cotton Salls in the first race
and dacron sails in the second. The order of finish in the first race was C, H, A, J,
B, E, I, F, G, D; in the second race it was A, B, H, J, I, C, D, F, E, G. Analyze these
results to shed light on the sail-maker's question.
79. Generalize the relations in (3.1)-(3.3) among T, T+, and T- so that they apply to
sums of all, positive, and negative signed constants Ck'
198 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

80. Represent the general test statistic based on the sum of signed constants as the
linear combination
T' = LCkSk
k

where Sk denotes the sign of the observation with rank k in absolute value. Under
the null hypothesis that the observations are independently, continuously dis-
trIbuted, symmetrically about 0, show that
(a) E(T') = 0
(b) var(T') = Lk cf
(c) T' is symmetrically distributed about O.
*81. Show that the signed ranks and the signs of the Walsh averages have the relation-
ship stated in Theorem 7.1.
*82. In a sample of size 10, suppose that the Walsh averages (Xi + X)/2 are negative
if both i and j are odd, and positive otherwise. What could be the signed ranks of
XI, .. ·,XJQ?
83. Show that, if Ck+ 1 ~ Ck ~ 0 for all k, the test based on the sum of the signed Ck'S
satisfies the hypothesis of Theorem 3.3 (increasing an observation never decreases
the probability of rejection l/J(X 1, ..• , X n»)'
*84. Consider n observations XI such that there are no ties in the Walsh averages.
(a) Show that, as j.J increases, the ~igned ranks of the centered observations
Xi - j.J change only when j.J equals some Walsh average (XI + X)/2.
(b) Show that, in the situation described in (a), the only changes are as follows.
If j = j, the signed rank of XI changes from 1 to -1. If i oF j and XI < X)' the
signed rank of XI changes from -(k + 1) to -(k + 2) and that of X) from
(k + 2) to (k + 1), where k is the number of observations between XI and Xj'
(c) Show that the sum of negative constants Ck increases by CI If i = j and by
(Ck+ 2 - Ck+ 1) if j oF j, while the sum of positive Ck'S decreases by the same
amount, and the sum of signed Ck'S decreases by twice this amount.
85. Show that the confidence region for the population center of symmetry j.J cor-
responding to a one- or two-sided test based on a sum of signed ck's is an interval
if Ck+ 1 ~ Ck ~ 0 for all k.
86. Show that, if Ck = 0 for k :;; n - m and Ck = 1 for k > n - m, then
(a) The test of Sect. 7.1 IS eqUIvalent to carrying out a sign test on the m observa-
tions largest in absolute value.
(b) The corresponding confidence limits are (X(k+ 1) + X(n-m+k+ 1)/2 and
(X(n-k) + X(m-k»/2, where k and m - k are the lower and upper critical
values for the number of negative observations in a sample of m. (Noether
[1973] suggests these confidence limits for easy calculation and studies their
efficiency.)
87. Show that, if Ck = 0 for k :;; n - m and Ck = k + m - n for k > n - m, then
(a) The test of Sect. 7.1 IS eqUIvalent to carrying out a Wilcoxon signed-rank test
on the m observations largest In absolute value. (This fact could be exploited to
reduce tabulation.)
(b) The corresponding confidence bounds are the (t + l)th smallest and largest
among those Walsh averages (X(I) + X()))/2 with j - j ~ 11 - m, where t is
Problems 199

the critical value of the negative rank sum in a sample ofm. (Hint: Show that the
rank of IX (I) I among the m largest in absolute value is the number of negative
X(I) + Xl)) withj - i ~ n - m if X(I) < 0.)

(c) How could these confidence bounds be found graphically?


88. For order statistics X h ) of a sample of n from a distribution which is continuous
and symmetric about 0, show that P(X(I) + XU) > 0 and X(k) + X(/) > 0) =
P(SI:S; i - I and SI + S2:S; k - 1) for i:S; k:s; 1 :S;j, where SI and S2 are
independently binomial with parameters t, n - j + i and t,j - 1 + k - i respec-
tively. (Hint: X(I) + Xl)) > 0 ifand only if there are less than i negative observations
among the n - j + i largest in absolute value. See also Problem 86.)
*89. (a) How many exact levels are available for a test based on a sum of signed Ck'S
if no two subsets of the Ck'S have the same sum?
(b) How many distinct confidence limits corresponding to such a test can there be
in any given sample? How can this number be so much less than the number
in (a)?
*90. Is the test which corresponds to the confidence bound in (7.5) equivalent to a
signed-rank test?
*91. Consider the test based on the sum of the signed Ck for Ck = k + 2- k •
(a) Show that this test agrees with the Wilcoxon test except that it distinguishes
rank orders with same signed-rank sum. Compare Mantel and Rahe [1980].
(b) Show that its P-value is always at least as small as the Wilcoxon P-value.
(c) Show that its P-value is always larger than the next Wilcoxon P-value. How
near can it come?
(d) When is this test more powerful than the Wilcoxon test? Equivalent? Less
powerful? What would be a fair way to compare the power of the two tests?
(e) Show that the expected one-tailed P-value (expected significance level) of
this test under the null hypothesis is 0.5 (1 + 1/2"). Show that it is 0.5 (1 +
Lp;) ~ 0.5(1 + liN) for a test statistic having N possible values with respec-
tive probabilities P, under the null hypothesis, with equality holding if and only
if P, = liN for i = 1, ... , N.
(f) Corresponding to this test, what are the confidence bounds for an arbitrary
center of symmetry J1? How do they compare with the Wilcoxon bounds?
(g) What changes occur if Ck is changed to k - 2- k for some values of k? What if
Ck = k ± e2- k , 0 < e < I?

92. Show that multiplying all Ck'S by the same positive constant has no effect on a test
based on the sum of signed Ck'S. What about a negative constant? What about
adding a constant to all Ck'S?
93. Given a procedure 4>, let IjJ consist of applying 4> after permuting the observations
randomly. Show that IjJ is a permutation-invariant procedure.
*94. In a testing situation where permutation invariance is applicable (every permuta-
tion n carries null into null and alternative into alternative distributions), show that
(a) If 4> is uniformly most powerful (at level IX), then 4> has the same power against F
and F n for all alternatives F and all permutations n, and there is a permutation-
invariant test which is uniformly most powerful.
(b) The statement in (a) remains true when the words "most powerful" are
replaced by "most powerful unbiased."
200 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

(c) The envelope power IS the same at F and F n for all F and 1t, and if there is a
most stringent test (at level ex), then there is a most stringent test which is
permutation invariant, where we use the following definitions. Let () index the
alternative distributions, let ex«(}; 4» be the power of 4> against 0, let the "enve-
lope power" ex*«(}) be the maximum of ex«(}; 4» over all 4> at level ex, and let b(4))
be the maximum of ex«(}; 4» - ex*(O) over all 0 (the maximum shortfall of 4»;
4>* is "most stringent" if it minimizes b( 4» among tests 4> at level ex.
(d) If there is a uniformly most powerful invariant test, then it is a most stringent
test.
(e) What properties of the permutations as a class of transformations are signifi-
cant for these results, and how?
*95. Let S be a set of transformations of the observations which are one-to-one and onto.
Suppose that, in a testing problem, all transformations in S carry null into null
and alternative into alternative distributions.
(a) Show that the same is true of all transformations in the group G generated by S
(under composition).
(b) Let 0 index the possible distributions of the observations. Show that Sand G
induce sets Sand G of transformations of () and that G IS the group generated
byS.
(c) Show that G is homomorphic, but not necessarily isomorphic, to G.

*96. Show that the following classes of transformations are groups:


(a) All permutations of X I " " , X".
(b) All transformations of the form g(X .), ... , g(X n } where 9 is a strictly Increasing,
odd functIOn.
97. Let 9 be a strictly increasing, odd function. Show that g(X) has a symmetric
distribution if and only if X does.
98. Show that if X I, ... , X n and X~, ... , X~ have the same signed ranks, then there
exists a strictly increasing, odd function 9 such that g(XJ) = X~,J = 1, ... , n.
*99. Show that, given any set S of transformatIOns of observations,
(a) There exists a statistic T (possibly multivariate) whose value is the same for
two different samples if and only if one sample can be transformed into the
other by some transformation in the group generated by S.
(b) A procedure is invariant under the transformations of S if and only if it depends
only on T.
(c) Give a set of transformations for which T would be the vector of signs of the
observations.
100. (a) Show that strictly Increasing, odd transformations do not generally preserve
the property of symmetry of a distribution about a value other than O.
(b) Which transformations always do?
(c) Give a (multivariate) statistic T such that a one-sample procedure is invariant
under all transformations in (b) and under permutations if and only if it
depends only on T.

101. (a) Show that the most powerful signed-rank test against F is of the form (9.1) if
all combinations of signed ranks are equally likely under the null hypothesis.
(b) Explicitly, how is k determined and what happens when Pirl, ... , /'n) = k?
Problems 201

102. Verify formula (9.2) for the probability of signed ranks 1'1, ••. , I'n in a sample from a
density.f~ .

103. Show that IX 1fJ) in (9.4) is thejth order statistic in a sample of n from the cumulative
distribution 2F 0 - 1.

104. Show that a test of the form (9.6) is equivalent to one based on a sum of signed
constants (9.7).

105. If h is defined by (9.3), show that


(a) h(x) = x/a 2 for the normal distribution with unknown mean fJ and known
varIance a 2 •
(b) h(x) = 2F o(x) - 1 forthe logistic distribution (9.11).

*106. Show that the locally most powerful signed-rank test against the Laplace family
of alternatives
e(X-O)/2 x:,:;fJ
Fo(x) ={ 1 _ e-(X-O)/2 x>fJ

fo(x) = e- 1x - ol /2
is equivalent to the sign test of Chap. 2.

*107. Let c, = E[log(1 + U,) - 10g(1 - U,)] where U, has a beta distribution with
parameters j and n - } + 1. Show that a test based on the sum of signed constants
Cj is a locally most powerful signed-rank test of 0 = 0 for every Lehmann family of
alternatives F 0 = F 1+ 0 where F is the cdf of a continuous distribution symmetric
aboutO.

*108. Obtain the sign test by invoking the principle of mvariance for a SUitable class of
transformations when the observations are not assumed identically distributed.

* 109. If a sequence of (n variate) distributions F,. converges in distribution to a distribu-


tion F as v --+ 00 and if A is a closed set, then P ,(A) ~ lim sup P v(A).
(a) Use the fact stated above to show that, if a test whose" acceptance" region is
closed has level ex for all distributions of a family ff, then it has level ex for all
limits in distribution of sequences in ff.
(b) Suppose that U is a continuous function of the observations and is an upper
confidence bound for a parameter fJ at level 1 - ex for all distributions of a
family ff. Show that P(U ~ fJ) ~ 1 - ex for all limits in distribution of se-
quences in the subset of ff where the parameter value is fJ.
(c) Show that the (joint) distribution of a sample from a discontinuous, sym-
metric distribution is the limit in distribution of distributions of samples from
continuous distributions symmetric about the same point. (Hint: Show it for
sample size 1 and then use the fact that if F,v converges in distribution to F"
then the joint dlstnbution of mdepcndcnt observations from FIV> i = 1, ... , II,
converges to that for F" i = I, ... , /I as v --+ 00.)
(d) Show that if an upper confidence limit U for the center of symmetry Ji of a
population has level 1 - ex for all continuous symmetric distributions and is a
continuous function of the observations, then P(U ~ Ji) ~ 1 - ex for all
symmetric distributions. (A similar result holds for lower confidence bounds
202 3 One-Sample and Paired-Sample Inferences Based on Signed Ranks

and for confidence intervals. Thus closed confidence regions are conservative
for discrete distributions.)
(e) Show that the confidence bounds corresponding to the Wilcoxon signed-rank
test satisfy the hypotheses of (d).
(f) Show that the statement in (e) also holds for tests based on sums of signed
constants Ck with 0 :::; C 1 :::; C2 :::; ••• :::; Cn • (See Problem 85.)
(g) Show that, in the hypothesis of (d), continuous distributions can be replaced by
distributions having densities.
CHAPTER 4
One-Sample and Paired-Sample
Inferences Based on the
Method of Randomization

1 Introduction
The signed-rank tests of the previous chapter rely on the fact that all assign-
ments of signs to the ranks of the absolute values of independent observations
are equally likely under the null hypothesis of distributions symmetric
about zero, or about some arbitrary point which is subtracted from each
observation before ranking. The same fact is true also for the observations
themselves, not merely for their ranks, and the idea underlying these tests
can be applied to any function of the observations, not merely to a function
of the signed ranks.
More specifically, consider any function of sets of sample observations,
and the possible values of this function under all assignments of signs to a
given set of observations, or equivalently, to their absolute values. Dividing
the frequency of each possible value of the function by the total number of
possible assignments of signs to the given observations generates a frequency
distribution called the randomization distribution. In general, this distribu-
tion depends on the given set of observations through their absolute values.
As its name indicates, the randomization distribution is derived merely
from the conditions that, given the absolute values of the n observations,
all 2n possible randomizations of the signs, each either + or -, attached to
the n absolute values, are equally likely to occur. As discussed in the previous
chapter, this condition holds under the null hypothesis of symmetry about
zero. In the paired-sample case it follows from the physical act of randomiza-
tion under the null hypothesis of no treatment effect whatever.
There are many interesting tests based on such a randomization distribu-
tion. Any statistic could be used as the test statistic, and any such test is
called a randomization test. If the value of the test statistic is determined by

203
204 4 One-Sample and Paired-Sample Inferences Based on the Method of RandOlTIlzatlOn

the signed ranks, as for the sign test or the Wilcoxon signed-rank test, the
test is a signed-rank test as defined in Chap. 3. A signed-rank test may also
be called a rank-randomization test, for contrast with more general ran-
domization tests. Similarly, an arbitrary randomization test may be called
an observation-randomization test to indicate that the value of the test
statistic is determined by the signed observations, as opposed to the signed
ranks. (The terms randomization test, rank-randomization test and observa-
tion-randomization test generalize to the case of more than one sample, as
we will see in Chap. 6.) The randomization distribution of a rank-randomiza-
tion (signed-rank) test statistic does not depend on the particular set of
observations obtained as long as their absolute values are all different.
For an observation-randomization test, however, the randomization dis-
tribution of the test statistic does depend in general on the observations
obtained, specifically on their absolute values. Since the randomization
distribution treats the absolute values as given and depends only on them,
observation-randomization tests are conditional tests, conditional on the
absolute values observed.
The principle of randomization tests is usually attributed to R. A. Fisher;
it is discussed in both of Fisher's first editions [1970, first edition 1925; and
1966, first edition 1935]. Many non parametric tests are based on this prin-
ciple, as it is easily applied to a wide variety of problems. The randomization
may be a designation of sign, or sample label, or an actual rearrangement of
symbols or numbers. The test criterion is frequently a classical or parametric
test statistic applicable for the same situation, or some monotonic function
thereof which is equivalent for the purpose but simplifies calculations. In all
situations, the randomization distribution derives from the condition that all
possible outcomes of the randomization are equally likely.
Since many randomization distributions are generated by permutations,
randomization tests are frequently called permutation tests. This name will
not be used here, however, because an interpretation of "permutation"
which is broad enough to include designating signs is not natural, and we
have already used this term in discussing "permutation invariant" tests (as
in Sect. 8, Chap. 3). Conditional tests is another possible name. However,
this is insufficiently specific since many statistical tests are conditional tests,
but with different conditioning than here. Another term which appears in
the literature is Pitman tests, since Pitman [1937a, 1937b, 1938] studied
them extensively. The term randomization test could lead to confusion with
randomized tests (Sect. 5, Chap. 1), which are entirely unrelated, but some
nomenclature must be adopted and none is perfect.
In this chapter we will first discuss the one-sample randomization test
based on the sample mean, and the corresponding confidence procedure.
Then we will define the general class of randomization tests for the one-
sample problem, study some properties of these tests, and obtain most
powerful randomization tests. Two-sample observation-randomization tests
will be covered in Chap. 6.
2 Randomization Procedures Based on the Sample Mean and EqUIvalent CrIterIa 205

While the presentation here is limited to the one-sample case, all the pro-
cedures are equally applicable to treatment-control differences and in
general to paired-sample observations when the differences of the pairs
are used as a single sample of observations. The hypotheses and any assump-
tions then refer to these differences and their distributions, as do the pro-
perties of the statistical procedures based on them. (See Sect. 7, Chap. 2.) It
is not necessary that a, paired-sample randomization test be based on only
these differences, but other tests are seldom needed and will not be discussed
in this book.

2 Randomization Procedures Based on the Sample


Mean and Equivalent Criteria
2.1 Tests

Given a sample of n independent, identically distributed observations


X 1> ••• ' XII' suppose we wish to test the null hypothesis that their common
distribution is symmetric about zero. Rejection of this null hypothesis
implies either that the distribution is not symmetric, or that it is symmetric
about some point other than zero. If the symmetry part ofthe null hypothesis
can reasonably be assumed as part of the model, this null hypothesis reduces
to a location hypothesis, as H 0: /1 = O. This is a natural assumption for the
test we will study here. The alternative then also concerns the value of /1,
and may be either one-sided or two-sided.
Consider the particular sample obtained as one member of the family
of all possible samples having the same absolute values IXII, ... , IX II I but
not necessarily the same algebraic signs. Under the null hypothesis, a priori,
the sign of Xj is as likely to be positive as negative, and each of the possible
sets of assignments of signs to all the absolute values is equally likely to
arise. Since there are 2" ways to assign plus and minus signs to the absolute
values, there are 2" members of the family of possible samples, which can
be enumerated, and each has probability 1/2".
Om~e some test criterion is selected, the value of the statistic can be cal-
culated for each member of the family. The null distribution of this statistic,
conditional on the absolute values of the observations, is then easily found.
This is the randomization distribution of the test statistic. From this dis-
tribution, a P-value can be found and a test performed in the usual way.
A natural criterion to use in this one-sample problem is the sample mean.
The calculations for generating the randomization distribution of the mean
are lengthy unless n is small (or only an extreme tail or P-value is found).
However, any statistic which is, for given absolute values, a monotonic func-
tion of the sample mean provides an equivalent randomization test because
the values of the function occur in the same order. Some specific statistics
206 4 One-Sample and PaIred-Sample Inferences Based on the Method of RandomIzatIOn

which are somewhat easier to use than the sample mean X but are equivalent
for the purpose of a randomization test (Problem 1) are the sum of all the
sample observations S = Lj Xj' the sum S+ of the positive observations,
and the sum S- of the negative observations with sign reversed (so that
S- ~ 0). Student's t statistic, calculated in the ordinary way, is also equiva-
lent, as will be shown below.
The method of generating the randomization distribution and the
equivalence of these test statistics is illustrated in Table 2.1. In practice, of

Table 2.1"
Sample Values: 0.3, -0.8,0.4,0.6, -0.2,1.0,0.9,5.8,2.1,6.1

X = 1.62 S = 16.2 S+ = 17.2 S- = 1.0


Sample Absolute Values
0.2 0.3 0.4 0.6 0.8 0.9 1.0 2.1 5.8 6.1 S X S+ S-

+ + + + + + + + + + 18.2 1.82 18.2 0


+ + + + + + + + + 17.8 1.78 18.0 0.2
+ + + + + + + + + 17.6 1.76 17.9 0.3
+ + + + + + + + + 17.4 1.74 17.8 0.4
+ + + + + + + + 17.2 1.72 17.7 0.5
+ + + + + + + + 17.0 1.70 17.6 0.6
+ + + + + + + + + 17.0 1.70 17.6 0.6
+ + + + + + + + 16.8 1.68 17.5 0.7
+ + + + + + + + 16.6 1.66 17.4 0.8
+ + + + + + + + + 16.6 1.66 17.4 0.8
+ + + + + + + 16.4 1.64 17.3 0.9
+ + + + + + + + 16.4 1.64 17.3 0.9
+ + + + + + + + + 16.4 1.64 17.3 0.9
+ + + + + + + + 16.2 1.62 17.2 1.0
+ + + + + + + + 16.2 1.62 17.2 1.0
+ + + + + + + + + 16.2 1.62 17.2 1.0
+ + + + + + + + 16.0 1.60 17.1 1.1

a These data are from Manis, Melvin [1955], Social interaction and the self concept,
Journal of Abnormal and Social Psychology, 51, 362-370. The X J are differences
X J = X~ - X;, where X~ is the decrease m the "distance" between a subject's self-
concept and a friend's impression of that subject after a certain period of time, and
X; is the corresponding decrease for a subject and a nonfriend; the subject-friend
paIr was matched with the subject-nonfriend pair accordmg to the value of theIr
"distance" at the beginning of the time period. Since the non friends were roommates
assigned randomly to the subjects, the subjects were expected to have the same
amount of contact with nonfriends as with friends during the time period. Manis'
hypothesis II was that over a given period of time, there will be a greater increase in
agreement between an indIVIdual's self-concept and his friend's impression of him
than there will be between an indivIdual's self-concept and his nonfriend's impres-
sion. This hypothesis is supported if the null hypothesis that the X J are symmetric
about 0 IS rejected in favor of a positive alternative. Manis used the Wilcoxon
sIgned-rank test, which gIves a one-tailed P-value of 0.0137 (Problem 2).
2 Randomization Procedures Based on the Sample Mean and Equivalent Cntena 207

course, the values of only one statistic would be calculated; we give details
for several only to make their relationship more intuitive. The first step is to
list the absolute values of the observations to which signs are to be attached.
It is generally easier to predict which assignments lead to the extreme values
if this listing is in order of absolute magnitude (increasing or decreasing).
While there are 2 10 = 1024 different assignments of signs, Table 2.1 enumer-
ates only those 17 cases which lead to the largest values of X, S, or S +, and
the smallest values of S-. S- = 1.0 was observed for these data and Table
2.1 shows that only 16 of the 1024 cases give S- that small; hence the one-
tailed P-value from the randomization test is 1~~4 = 0.0156.
The calculation is most easily performed in terms of S- when X is "large"
as here, and in terms of S+ when X is small. Ifthe same value of S+ or S-
occurs more than once, each occurrence must be counted separately. Al-
though calculating the complete randomization distribution straightfor-
wardly would require enumerating 2" possibilities, this is never necessary
for any randomization test. In order to find P (the P-value), the enumeration
must include only those 2"P assignments which lead to values as extreme as
that observed in the appropriate direction (that is, less than or equal to an
observed S- or greater than or equal to an observed S+, when S- ~ S+).
Furthermore, if a nonrandomized test is desired at level 0(, a decision to
reject (P ~ O() can be reached by enumerating these 2"P cases, and a decision
to "accept" by identifying any 2"0( cases as extreme as that observed. A
test decision by direct counting therefore requires enumeration of only
2"0( or 2"P cases as extreme as that observed, whichever is smaller. To com-
pute the P-value, of course, it is necessary to enumerate every point in the
relevant tail, and it is difficult to select them in the correct order for enumera-
tion except in the very extreme end of the relevant tail. Considerable care is
required if the entire distribution is not calculated. A systematic approach is
to enumerate according to the number of + or - signs (starting with 0).
Clever tricks, such as the "branch and bound" method of mathematical
programming, might reduce the work entailed. Even to calculate the entire
distribution, it is sufficient to enumerate 2"/2 = 2"-1 assignments, since the
randomization distributions of S+, S-, X, Sand t are all symmetric (Problem
4). This also means that the natural two-tailed test is equal-tailed. Ap-
proximations for use when exact calculation is too difficult will be discussed
in Sect. 2.5.
If the null hypothesis is generalized to state that the observations are
symmetrically distributed about some point J1.o, or the symmetry is assumed
and the null hypothesis is H 0: J1. = J1.o, the same procedure can be used but
applied to the observations X j - J1.o. That is, the randomization distribution
is generated by assigning signs to the 1X} - J1.o I.
Since, given the IX}I, the statistics S, X, S+, and S- are all linearly related,
they provide equivalent randomization test criteria. Although the ordinary
t statistic is not linearly related to these other statistics, it also provides an
equivalent randomization test criterion, as we now show.
208 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization

*Writing the sample variance as

~(Xi - X)21n = (~xJln) - (~Xirln2 = (n ~XJ - S2)ln2,


we can write the ordinary t statistic as
(n - 1)1/2S [n - 1 Jl/2
t = (n Li XJ - S2)1/2 = ± (n Li XUS2 ) - l' (2.1)
where the sign is the same as the sign of S. For given absolute values IX 1 I,
... , IXn I, the sum Li XJ in the denominator is a constant, and hence t is a
monotone increasing function of S and therefore equivalent as a test
statistic. *
The fact that a randomization test based on the usual t statistic is valid
for any symmetric null distribution does not imply either that t has Student's t
distribution or that the usual t test is valid without the assumption of nor-
mality. The normal theory test compares the observed value of the t statistic
with a critical value from Student's t distribution (which is fixed, depending
only on the sample size), while the randomization test compares it with a
critical value obtained from the randomization distribution (which varies
from sample to sample even for a given sample size). Accordingly, even
though the test statistics are equal, the test procedures are not equivalent
because a P-value or critical value based on the randomization distribution
does not in general agree with one based on Student's t distnbution.
In addition, the large-sample theory of the randomization test does not
point particularly to the Student's t test. To order 1/Jn, the accuracy of
usual large-sample theory, Student's t and the normal distribution are in-
distinguishable. Any asymptotic calculation fine enough to distinguish them
will show that the randomization distribution of the t statistic, Student's t
distribution, and the normal distribution all differ by terms of the same
order, namely lin. Specifically, it is well-known (and easily seen by inspecting
a table) that Student's t distribution differs from the normal distribution
by terms of order lin, while the approximations developed in Sect. 2.5 show
that the randomization distribution of the t statistic differs from Student's t
distribution by terms of the same order (smaller order if the popUlation has
normal kurtosis).
In short, the fact that we can perform a randomization test based on the
t statistic does not imply that Student's t distribution provides either a valid
test in small samples or asymptotically better validity than other distributions
approaching normality at the rate lin. For numerical results, see Sect. 2.5(c)
and the references cited there.

2.2 Weakening the Assumptions

We have been assuming that the observations are independent and identically
distributed with a symmetric common distribution under H o. If the assump-
tions are relaxed so that the X J are not identically distributed but are in-
2 RandomizatIOn Procedures Based on the Sample Mean and EqUIvalent Crltena 209

dependent and symmetrically distributed about 0 (or about the hypothesized


value Jlo), then the level of the test is not affected (Problem Sa). (The level
of the corresponding confidence procedure to be discussed in Sect. 2.3 is
preserved if the X j are independent and symmetrically distributed about a
common point Jl.) The test is also valid for a null hypothesis of no treatment
effect if the Xj are the treatment-control differences in a matched-pairs
experiment with controls selected randomly from the pairs (Problem 6).
If the Xj are continuously distributed, then the tests (and corresponding
confidence bounds) have level (X = kl2n for some chosen integer k; if they are
discontinuously distributed, then the procedures as described are conserva-
tive (Problems 5 and 8b; Problem 109 of Chap. 3).
These relaxations of the assumptions and others descnbed In Sect. 3.5,
Chap. 3 are possible here for the same reasons as given there.

2.3 Related Confidence Procedures

Under the symmetry assumption, the randomization test procedure can


also be used to construct a confidence region for the value of Jl, the center
of symmetry. Unfortunately, the randomization distribution is different
when different values of Jl are subtracted from the observations. The con-
fidence region is an interval, and its endpoints could be obtained by trial
and error. (That is, successive values of Jl are subtracted, larger or smaller
than previous values as appropriate, until the value of S+ or S- equals the
appropriate level (X upper or lower critical value of a randomization test
for center of symmetry 0.) The endpoints of the normal theory confidence
interval for Jl at the same level could be used as initial trial values.
However, as in the case of the Wilcoxon signed-rank sum procedure, a
more systematic and convenient method is available for finding exact con-
fidence limits; this was suggested by Tukey but developed in Hartigan
[1969]. For a sample ofn observations, there are 2n - 1 different subsamples
(of all sizes, 1, 2, ... , n, but excluding the null subsample). Consider the
subsample means in order of algebraic (not absolute) magnitude. It can be
shown that the kth smallest and kth largest subsample means are the lower
and upper confidence bounds respectively, each at level 1 - (x, that corres-
pond to the one-sample randomization test based on X (or any other
equivalent randomization test criterion), where the one-tailed level is
(X = kl2 n, so k = 2n(X (Problem 8). If it is infeasible or impractical to find the
kth smallest or largest among all the subsample means, one can find the
kth among a smaller number of subsample means, say m, selected either at
random without replacement (Problem 9) or in a "balanced" manner
determined by group theory (as explained by Hartigan [1969]); then
(X = kl(m + 1). Similar methods for randomization tests are discussed in

Sect. 2.5 below.


As explained in Sect. 2.2, these confidence procedures are also valid under
certain relaxations of the assumptions.
210 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization

2.4 Properties of the Exact Randomization Distribution

As mentioned in Sect. 2.1, the statistics S, X, S+, S- and t all provide equiv-
alent randomization test criteria and all have symmetric distributions,
although only S+ and S- are identically distributed.
The means and variances of these statistics under the randomization
distribution are easily calculated. These moments are of course conditional
on the absolute values of the observations. Since they are considered
constants, we denote the values IX 11, ... , IXn I by a1"'" an' Then for
S = LJ Xj (Problem 11), for instance, we have
E(S) =0 (2.2)
var(S) = L aJ = (12. (2.3)
j

Thus the center of symmetry for S is zero, as it is for X and t, but not for
S+ or S- . Note that (12 is defined by (2.3) and is the variance of the randomiza-
tion distribution of S, not the popUlation variance of the Xj' although it is
an unbiased estimator of n times the latter under the null hypothesis.

2.5 Modifications and Approximations to the Exact Randomization


Procedure

The exact randomization distribution can always be enumerated by com-


puter. Complete enumeration is not necessarily required to find tails of the
distribution, or individual tail probabilities. Mathematical programming
methods, such as branch and bound, can be used. However, there may be
situations where a randomization test is desired but it is infeasible or im-
practical to carry it out exactly. One might then use (a) simulation ("Monte
Carlo" sampling) of the complete randomization set, (b) a deterministically
restricted randomization set, or (c) the normal or some other approximation.
(a) Simulation (Sampling the Randomization Set). One method of reducing
the size of the computational problem is to use only some of the absolute
values, selected either at random or on some systematic basis that does not
depend on the observed signs. This appears unduly wasteful, and it is pre-
sumably more efficient to use a subset of the complete randomization set
without discarding observations.
The most direct approach would be to sample from the randomization
distribution, with or without replacement. For example, we could read n
random digits from a table or generate them by computer and assign a + sign
to IXj I if the jth random digit is even, a - sign otherwise; this would provide
one observation from the randomization distribution. Having obtained the
assignment of signs, it is a simple matter to compute the chosen test criterion,
such as S here. The process of assigning signs at random and computing
the test criterion can be repeated until a large number of values have been
2 RandomizatIOn Procedures Based on the Sample Mean and Equivalent Criteria 211

obtained. The relative frequencies of these values comprise the "simulated"


randomization distribution.
The usual simulation method would be to sample as above with replace-
ment and to estimate the P-value of a randomization test by the correspond-
ing relative frequency in the simulated distribution. In the present situation,
instead of considering this to be an estimate of P, one could redefine the
test as one that rejects if the relative frequency of equally or more extreme
values in the simulated randomization distribution is 0( or less. This relative
frequency is the P-value of the redefined test, and the redefined test has level
0(. This holds whether the sampling is done with or without replacement, but

to be precise the simulated samples must be augmented by the actual sample


in computing the relative frequencies (Problem 9). Thus it is possible to obtain
the desired significance level and a precise P-value without carrying out a
large number of simulation trials, as would be required to estimate a small
probability accurately. Of course, this procedure entails some loss of power;
the fewer the trials carried out, the greater the loss. Dwass [1957] gives an
indication of the size of the loss. Valid confidence bounds can be obtained
similarly by sampling the subsample means (Problem 9).
More sophisticated methods of simulation, Le., of sampling the randomiza-
tion distribution, might produce much more accurate estimates with a given
number of trials. Such methods would be expected to entail a smaller loss of
power, and at least some of them would be expected to provide a precise
level and P-value when the simulation process is regarded as part of the
definition of the test, not merely as an approximation procedure. This will
not be investigated further here, however.
(b) A Deterministically Restricted Randomization Set. Instead of con-
sidering all 2n possible reassignments of signs to Xl"'" Xn, one might
restrict consideration to a subset. If the subset has the special property of
being a group (a subgroup of the group of all reassignments of signs), then it
can be used to define a valid test in the same way as the whole randomization
set. Problem 12 illustrates the group property and Problem 13 illustrates the
fact that the equally likely property may fail in the absence of the group
property.
More specifically, suppose J = (J 1, ••. , I n ) stands for any vector of l's
and - 1'so Let G be a set of such vectors with the property that, if J and J'
belong to G, then JJ' = (J 1J'l"'" JnJ~) also belongs to G. (In particular,
J2 = (1, ... ,1), the identity for this form of multiplication, belongs to G, and
every J is its own inverse.) Consider the set Gx of values (J 1 X 1> ••• , J n X,,) for
all J in G (the" orbit" of (X 1> ••• , X /I»' Under the randomization distribu-
tion, given the set Gx to which X belongs, all members of Gx are equally
likely. Thus the value of the criterion at X can be compared with its values
for all members of Gx just as before. The null hypothesis is rejected if the
value at X is among the k smallest (or largest) of these values, and the one-
tailed level is k divided by the number of members of G. The power, of course,
depends on the choice of G.
212 4 Onc-Sample and Paired-Sample Infcrences Bascd on thc Method of RandOll1lzatlOn

The confidence bounds for the center of symmetry f.1. corresponding to


the randomization test based on X with the randomization set restricted as
described above can be obtained as follows (Problem 14). For each J in G,
find the subsample mean of those Xj with J j = -1. The kth smallest and
kth largest of these subsample means are the confidence bounds.
(c) Approximations to the Randomization Distribution. We now discuss
some approximations to the randomization distribution that are based on
tabled probability distributions. We consider first the ordinary normal
approximation based on the standardized form of the random variable S,
which we denote here by Z for convenience. The mean and variance of S
given in (2.2) and (2.3) lead to

Z = S/a = S/JL a;. (2.4)

When the randomization distribution of Z in (2.4) is approximated by the


standard normal distribution, the results of Zahl [1962] show that the error
in the approximated tail probability is at most

'I' 3 C('I' 3, '1'4' Z)e - z 2/2 (2.5)


where
'1', = L a~/a' (2.6)
J

for r = 3 and r = 4 and C in (2.5) is a function which is generally between


0.5 and 2.0. The final factor e- Z2/2 in (2.5) is less than 0.2 for IZI ~ 1.8 and
less than 0.1 for IZ I ~ 2.15. Note that '1'3 is of order n - 1/2 (in fact, about
2.3n - 1/2 for a typical normal sample), and '1'4 is of order n - 1. Moses [I 952J
gives the rule of thumb that this normal approximation should not be used
unless maxj (aJ)JLJ aJ :$; 5/2n.
Although this normal approximation procedure appears to reflect pro-
perties specific to the randomization distribution while the ordinary t
procedure does not, the following argument shows that this is an illusion
and suggests that the latter procedure may actually be better than the
former procedure as an approximation to the randomization procedure.
We showed in Sect. 2.1 that the randomization test based on the ordinary
t statistic is equivalent to one based on S. If we write the ordinary t statistic
in (2.1) in the form
_ (n _Z2I
t - Z
)1/2
' (2.7)
n-
it is clear that comparing Z with a percent point ZIZ of the standard normal
distribution is equivalent to comparing t with

ZIZ (~)1/2
2. (2.8)
n- Zil
2 RandOlTIlzatlOn Procedures Based on the Sample Mean and Equivalent Cntel Ja 213

As a result, the normal approximation to the randomization test merely


replaces the critical value ta of the ordinary t test by (2.8), which is, like ta'
a constant that does not depend on the sample observations. (The two
critical values, ta and the quantity in (2.8), differ considerably in small
samples except for !Y. near 0.05, although both approach Zil as n -+ 00. See
Problem 15.) If the randomization test is to be performed by comparing the
t statistic (2.1) or (2.7) to some constant value, then under normality, til
will be a better constant on the average than (2.8). This suggests that til
will be better in general, at least for samples that are not too far from normal.
Another bad feature of (2.8) is that it becomes infinite before !Y. reaches zero,
namely for z" = In.
*So far it appears that we have not improved on the ordinary t test as an
approximation to the randomization test. We now discuss some more
refined approximations intended to give an improvement. The first ap-
proximation is motivated by the following reasoning. Recall that if t has
Student's t distribution with n - 1 degrees of freedom, then t Z has the F
distribution with 1 and n - 1 degrees of freedom, and hence, equivalently,
t 2 /(t Z + n - 1) = ZZ In has the beta distribution with parameters t and
(n - 1)/2. However, the moments of these F and beta distributions differ
from the corresponding moments of the randomization distributions of t Z
and ZZ In except in special cases (Problem 16). Hence, a better approximation
should result if we adjust the parameters of the F or beta distribution so that
its first two moments are equal to the first two moments of the corresponding
randomization distribution. These approximations will match the first four
moments of the randomization distributions of t and Z respectively, their
first and third moments being equal to zero by symmetry.
We show below that the adjustment required to match these moments
in the beta form (the F form is harder and not equivalent) is equivalent to
multiplying the normal theory degrees of freedom for F, namely 1 and n - 1,
by a constant correction factor d defined by

d = (n - 3)(14 + 2 L·J a'!-J (2.9)


n(cr 4 - Li aj)
where '1'4 is defined by (2.6) with r = 4. Hence this method of approximation
is to treat t 2 , the squared value of (2.7), as F distributed with fractional
degrees of freedom d and (n - 1) d.
In order to get an idea of the size of d, the correction factor multiplying the
degrees of freedom, we rewrite (2.9) in the form (Problem 17a)

d= 1+ eZn- 3)(1 - n; 2r 1
(2.10)

where

bz = (n + 2) ~ aj I(~ a; ) 2 = (~ + 2)'1'4. (2.11 )


214 4 One-Sample and PaIred-Sample Inferences Based on the Method of RandomIzatIOn

Under the null hypothesis, b2 is a consistent estimator of the kurtosis of the


distribution of the Xj (Problem 17b); the kurtosis parameter equals 3 for
normal distributions. Hence, from (2.10) we see that d will differ from 1 by
order lin, with the amount of correction depending on the departure of
kurtosis from that of the normal distribution.
The small fractional number of degrees of freedom d in the numerator of
the foregoing F distribution may sometimes preclude its use in applications.
A further approximation would be to replace this F distribution by a scaled
F distribution with 1 and k degrees of freedom, where the scale factor and
k are chosen to give the same first two moments. This leads (see below and
Problem 18d) to treating c times the t statistic as Student's t distributed with
k degrees of freedom where
k _ nd(4 - d) + d 2 + 4d - 8
(2.12)
- nd(1 - d) + d 2 + 4d - 2'

c = (k - 2 den - 1) )1/ 2. (2.13)


k d(n - 1) - 2
In order to get an idea of the size of the correction factor here, we write d
in (2.10) in the form
y
d=l+- (2.14)
n
where y is a measure of the deviation from normal kurtosis. Substituting
(2.14) for d in (2.12) and (2.13) we obtain after some algebra

k= n- 1 + 3 _y y ( n - 5 - 3 3y
_ y) + terms of order (lin)

c= 1 + In + terms of order (1ln 2 ).

These results show clearly the order of magnitude of the corrections to the
normal theory values k = n - 1, c = 1.
We illustrate both of these approximations using the Darwin data (where
n = 15) introduced in Sect. 3.1, Chap. 3. Fisher [1966 and earlier editions]
gives the one-tailed P-value in the randomization distribution as 863/2 15 =
0.02634. Using Student's t distribution with n - 1 = 14 degrees of freedom,
Fisher obtains 0.02485 for the ordinary t test (t = 2.148), and 0.02529 with
a continuity correction to allow for the discreteness of the measurements
(t = 2.139). If we use the approximation that treats t 2 as F distributed with
d and (n - l)d degrees of freedom, we first calculate d from (2.9) for these
same data as d = 0.937; then the F distribution with degrees of freedom
d = 0.937 and (n - l)d = 13.12 gives the one-tailed P-value as 0.02643
without a continuity correction, and 0.02686 with one. If we use the method
2 RandomizatIOn Procedures Based on the Sample Mean and EqUivalent Cntena 215

of approximation by scaled F (or t) for these same data, we calculate by


(2.12) and (2.13) k = 11.31, c = 0.9855, ct = 2.108 (with a continuity correc-
tion). Treating ct as Student's t distributed with k = 11.31 degrees of freedom,
we obtain a one-tailed P-value of 0.02912. In this example ct gives a less
accurate approximation to the randomization distribution than the ordinary
t, but presumably this is just a coincidence. *
*PROOF. Under normal theory, t 2 has an F distribution with 1 and n - 1
degrees of freedom, or, equivalently, W = t 2 /(t 2 + n - 1) has a beta dis-
tribution with parameters t and (n - 1)/2. For the ordinary t statistic,
(2.7) shows that
t2 Z2
W = t2 + n - 1 =--;-
so that under normal theory, the marginal distribution of Z2 In is beta. This
suggests approximating the randomization distribution of Z2 In by a beta
distribution with first two moments equal to the moments of the randomiza-
tion distribution of Z 2 /n, which are (Problem 18b)
E(Z2 In) = l/n (2.15)
The first and second moments of the beta distribution with parameters a
and bare

E(W)=~b E(W2) _ a(a + 1) (2.16)


a+ - (a + b)(a + b + 1)
If we equate the moments in (2.15) to the corresponding moments in (2.16)
and solve the resulting equations simultaneously for a and b, the solution is
n - 3 - 2'1'4
a - ---_=_- b = (n - l)a. (2.17)
- 2n(l - '1'4) ,
Thus we are led to approximate the randomization distribution of Z2/n
by a beta distribution with the parameters in (2.17). This procedure is
equivalent (Problem 18c) to approximating the randomization distribution
oft 2 by an F distribution with degrees of freedom 2a = d and 2b = (n - l)d
M~~~* 0
Notice that we have given no formal statements or proofs about asymp-
totic distributions or asymptotic errors of approximation. Such formal
results would be complicated by the fact that the randomization distribution
is conditional upon the absolute values of the observations, which are
themselves random. It is not enough to apply the central limit theorem and
standard asymptotic methods and conclude that, for instance, the marginal
distribution of Z is asymptotically standard normal. What we really want to
know is that the conditional distribution of Z, given IXII, ... , IXn I, "ap-
proaches" the standard normal distribution in some sense that we have not
216 4 One-Sample and Patred-Sample Inferences Based on the Method of Randomization

even defined. There are several natural definitions, but under any of them,
the conditional result is much stronger than the unconditional one and does
not follow from an ordinary central limit theorem (Problem 20).

3 The General Class of One-Sample Randomization


Tests
3.1 Definition

The randomization test based on the sample mean, or any of the equivalent
test criteria given in Sect. 2.1, relies for its level only on the assumption that,
given the absolute values of the observations, all assignments of signs are
equally likely. This randomization test and randomization distribution are
therefore conditional on these absolute values. In particular, this means that
the level of the randomization test is IX if, under the null hypothesis, the con-
ditional probability of rejection given IX 11, ... , IXn I is at most IX, and that
the P-value is the corresponding conditional probability of a value of the
test statistic equal to or more extreme than that observed.
Generalizing, we define a one-sample randomization test for center of
symmetry equal to zero as a test which is conditional on the absolute values
of the observations, having a null distribution that is the randomization
distribution generated by assigning signs randomly to these absolute values.
Tests which are members of this general class may depend on the actual
values of the Xj. Signed-rank (or rank-randomization) tests are those which
depend only on the signed ranks of the Xj. Any signed-rank test, including
the Wilcoxon signed-rank test or any test based on a sum of signed constants
(as defined in Sect. 7.1, Chap. 3) is a randomization test; however, not all
randomization tests are signed-rank tests, as the test of Sect. 2.1 based on
the sample mean or equivalent criteria is not.
The class of all randomization tests is even broader than it may seem. Two
other specific examples of randomization tests are described below.
(1) Consider the composite test defined as choosing a constant c and ap-
plying a level IX Wilcoxon signed-rank test whenever Lj XJ ~ c(Lj IXj 1)2,
and a level IX sign test otherwise. This is a randomization test because given
the IX) I, the conditional level of the composite test is IX regardless of which
test is used. Of course, any number of component tests may be used, but in
order for such a composite test to be a randomization test, the rule for
choosing which component test to use must depend only on the IX)I. A
rule of this kind aimed at achieving good power for a wide variety of dis-
tributions has come to be called "adaptive" [Hogg, 1974]. The possibility
was recognized at least as early as 1956, when Stein [1956] showed that in
this way, for large n, power can be attained arbitrarily close to that of the
3 The General Class of One-Sample RandomizatIOn Tests 217

most powerful test against every translation alternative regardless of the


shape of the distribution, as long as its density is sufficiently regular and
symmetric.
(2) Another possibility would be to use a level ()( Wilcoxon signed-rank
test whenever I X 1 I > IX 21 and a level ()( sign test otherwise. This composite
test is also a randomization test, but it is not permutation invariant, that is,
the result of the test could be altered by a rearrangement of the Xj' Permuta-
tion invariance (see Sect. 8, Chap. 3) is not a requirement of randomization
tests, although it is generally natural and advantageous, as we will see in the
next subsection.

3.2 Properties

We now turn to some of the theoretical properties of members of the class


of one-sample randomization tests for zero center of symmetry.

Critical Function and Level

Let ¢(X t, ... , Xn) denote the probability of rejection (critical function) of
an arbitrary test. We write X J = ajlj where aj = IXjl and
-I if X. < 0
I. = { J
J 1 if Xj > O.
Then we may write the critical function as
(3.1)

The aj = IXjl are to be treated as constants. Given the IXjl = aj, the con-
ditional expected value of the critical function, under the null hypothesis
Hoof a distribution symmetric about zero, is simply the mean of its random-
ization distribution, or

Eo[¢(Xt> ... ,Xn)IIXd = at, ... ,IXnl = an] = "L¢(±at, ... , ±an)/2n,
(3.2)
where the sum is over all the 2n possible assignments of signs. This expected
value is the conditional probability of a Type I error for the test. Accordingly,
a test ¢ of H 0 has conditional level ()( if the quantity in (3.2) is less than or equal
to ()(. If this holds for all at, ... , an, then the test is a randomization test. Any
such test also has level ()( unconditionally (Problem 21) by the usual argument
(see Sect. 6.3, Chap. 2).
In the notation just introduced, we may say that a signed-rank (or rank-
randomization) test is one whose critical function depends on the aj = IXjl
only through their ranks, as well as the signs Ir
218 4 One-Sample and Paired-Sample Inferences Based on the Method of RandomizatIOn

Weakening the Assumptions

The statements made in Sect. 2.2 about weakening the assumptions apply
without change to all members of the general class of one-sample randomiza-
tion tests, as do the statements of Sect. 3.5, Chap. 3 provided the tests are
monotonic where monotonicity is obviously called for.

Justifications of Randomization Tests

Section 8, Chap. 3 presented an argument based on concepts of in variance


to justify restricting consideration to signed-rank tests. Since all signed-rank
tests are randomization tests, this same argument justifies restricting con-
sideration to randomization tests. This argument required a very strong
assumption, however, namely in variance under increasing, odd transforma-
tions (Sect. 8.2, Chap. 3). We shall now discuss quite different assumptions
leading to randomization tests (not necessarily rank tests).

Observations not identically distributed. It is elementary that, as remarked


above, any test having conditional level 0( given the IX}I also has level 0(
unconditionally. It can also be shown that, if the null hypothesis is sufficiently
broad, then conversely, any test having (unconditional) level 0( must also
have conditional level 0( given the IXj I, and therefore must be a randomization
test. Such a strong result is not available for independent, identically dis-
tributed observations, as will be discussed shortly. For independent, not
identically distributed observations, however, either of the following null
hypotheses is sufficiently broad to imply the conclusion (as are certain less
broad hypotheses).
Ho: The Xj are independent, each having a distribution that is symmetric
about zero.
H'o: H° holds and the distributions have densities.
Thus the statement is as follows.

Theorem 3.1. If a test has level 0( for either Ho or H'o, then the test is a rall-
domization test.
PROOF. In order to see this for H o , suppose that the conditional level were
greater than 0(, given some set of absolute values IX tI = ai' ... , IXn I = an'
Consider independent Xj with a distribution such that P(X J = a) =
P(Xj = -a) = t; then Ho is satisfied (H'o is not) but the null probability
of rejection is greater than 0(. This contradiction shows that the supposition
is impossible and any level 0( test of this null hypothesis has conditional
level 0( given IXII, ... , IXn I. The result for H'o will not be proved here. See
Lehmann [1959, Sect. 5.10] or Lehmann and Stein [1949]. 0
3 The General Class of One-Sample Randomization Tests 219

Observations identically distributed. A similar result holds for observations


which are independent and identically distributed, provided the null hypo-
thesis is again sufficiently broad, and in addition the tests are unbiased
against a sufficiently broad alternative hypothesis. The proof requires that
the alternative include distributions which are arbitrarily close to each dis-
tribution under the null hypothesis, so that, by the usual argument, the test
will have to have level exactly (X under every null distribution. Furthermore
the kind of randomization test obtained, which we will now define, is tech-
nically weaker, though practically the same.
The randomization test defined in Sect. 3.1 is based on the 2" assignments
of signs and does not consider permutations of the order of the absolute
values of the observations. For identically distributed observations, all such
permutations are equally likely. If we consider all possible orders of the
absolute values as well as all assignments of signs, then we obtain a ran-
domization distribution which has (n! 2") equally likely possibilities instead
of just 2". We will call a test based on this latter randomization distribution
an (n! 2")-type randomization test and the type discussed earlier a 2"-type.
Under the null hypothesis that XI"'" X" are independent and identically
distributed, symmetrically about zero, a conditional test given the order
statistics of the absolute values is an (n! 2")-type randomization test (and
conversely), because all assignments of signs and all orderings of the absolute
values are equally likely. To be more specific, if rjJ is a test function and
IX 1(1) ~ ... ~ IX 1(") are the order statistics of the absolute values of the
observations IXj I, then the conditional level of rjJ is
Eo [rjJ(X " ... , X,.) II X 10), ... , IX 1(")] = L rjJ( ± IX l(i,)' ... , ± IX l(in»/2"n!
(3.3)
where the sum is over all 2"n! possible assignments of both the signs and the
arrangements (permutations) ii, ... , i" of the integers 1, ... , n. Thus the
conditional level of rjJ given the order statistics of the absolute values is its
level under (n! 2")-type randomization. A 2"-type randomization test is
conditional on the absolute values IX,I, ... , IX"I in their observed order,
and an (n! 2")-type need not be. Any 2"-type randomization test, being more
conditional, is also an (n! 2")-type randomization test, but not conversely
(Problem 22a). For permutation invariant tests, however, the two types are
equivalent, as will be discussed shortly.
We now give conditions which lead to (n! 2")-type randomization tests.
The first condition is that the test have level ex for a sufficiently broad null
hypothesis, such as
H'O: The Xj are independent with a common distribution that is sym-
metric about zero.
H~': H'O holds and the common distribution has a density.

The second condition is that the test be unbiased against a sufficiently


broad alternative hypothesis (so that every null distribution is a limit of
220 4 One-Sample and Paired-Sample Inferences Based on the Method of RandomIzatIOn

alternative distributions). Alternatives which specify identical, symmetric


distributions but are still sufficiently broad are

H';: The Xj are independent with a common distribution that is sym-


metric about some point Jl. =I: O.
H~': H'; holds and the common distribution has a density.

The complete statement is given as follows.

Theorem 3.2. If a test has level IXfor H'O and is unbiased against H';, or has
level afor H'f) and is unbiased against H';', then it is an (n! 2n)-type randomiza-
tion test.

The proof of this property will be given shortly. As will be evident from
this proof, other, less broad, null and alternative hypotheses would also
imply the conclusion of this theorem.
The result in Theorem 3.2 "justifies" restricting consideration to (n! 2n)_
type randomization tests when the observations are identically distributed,
but we have not yet justified the further restriction to 2n-type randomization
tests. Recall, however, that in Sect. 8, Chap. 3, we defined a procedure as
permutation invariant if it is unchanged by permutations of the observations.
If a test is permutation invariant, then averaging over the permutations has
no effect, and hence the two types of randomization are equivalent. As a
result, a permutation-invariant test is a 2n-type randomization test if and
only if it is an (n! 2n)-type randomization test (Problem 22b). The reasons
given in Sect. 8, Chap. 3 for using a test which does not depend on the order
of the observations apply equally here. In particular, for independent, iden-
tically distributed observations, nothing whatever is gained by taking the
order into account. Therefore, any test of H'O against H';, or H'f) against
H~', may as well be taken as permutation invariant, and if it is unbiased, it
must be a randomization test (2n-type, or equivalently, (n! 2n)-type).

PROOF OF THEOREM 3.2 We outline here a proof of Theorem 3.2 for con-
tinuous distributions only. A proof for the discrete case is requested in
Problem 23. Suppose we have a test of Ho' which has level IX under every
common continuous distribution that is symmetric about 0 and is unbiased
against the alternative H~' of a common continuous distribution that is
symmetric about some other point. In order to prove that this test must have
conditional level IX given IXI(!), ... , IXI(n)' we need prove only that the
IX 1(1)' ... , IX I(n) are sufficient and boundedly complete for the common
boundary K of H'f) and H~', which is H'f) (by Theorem 6.1 of Chap. 2). The
sufficiency part is trivial, and will be left as an exercise (Problem 25). To
prove completeness, consider the family of symmetric densities given by
3 The General Class of One-Sample Randomlzallon Tests 221

Let 1k = Lj IXl for k = 1, ... , n. Since the 1k determine the coefficients


of the polynomial g(x) = nj~l(x -IXjl), the order statistics IXI(j) are
functions of the 1k and hence so is Lj Xl". If
E[¢(IXI(I)' ... , IXI(n»] = 0
under all distributions in (3.5), then

f··· f¢(IXI(l)'···' IXI(n»J(t l ,···, til) exp(t: Oktk) dt l ... dtn = 0

where the integration is over the region of possible values of T1, ••• , T"
and where J(t I, ... , tn) is exp( - Lj Xi") times the Jacobian of the IX I(j)
with respect to the 1k. It follows by the theory of Laplace transforms that the
integrand is zero and hence that ¢ = O. For more details on this approach see
Lehmann [1959, pp. 132-133]. Alternatively, see Fraser [1957b, pp. 28-31]
or Lehmann [1959, pp. 152-153] for an approach that is derived from the
discrete case. These sources deal with the usual order statistics X(I)' ... , X(n)
and arbitrary densities rather than I XI(l)' ... , IX I(n) and symmetric densities;
this difference affects the proof only slightly. 0

Matched Pairs

In a matched pairs experiment with, say, treatment and control observations


on each pair, it is usually assumed that the test procedure will depend only
on the treatment-control differences. Under this assumption, the foregoing
results apply. Similar results can be given which do not require this assump-
tion and are more in the spirit of matched pairs, referring to the structure
of the situation. One such result is described below.
Let Xj and Xi denote the observations on the jth pair, and consider the
null hypothesis
H(;: The Xi and Xi are independently normal with common variance
(J2 > 0 and arbitrary means J1.'j and Ili respectively. The treatment is
assigned randomly to one member of each pair and has no effect on
the observed value Xj or Xi of the treated unit.
It can be shown that if a test has level (J. for this null hypothesis, then it is a
conditional test given the Xi and Xi, and relies only on the random assign-
ment of the treatment within pairs. The proof is quite similar to that given
above for H'O; see the references cited there. Several remarks should be made,
however.
First, the null hypothesis Hci sets up a normal model with completely
arbitrary unit effects and no treatment effect on any unit. The treatment-
control difference Xj for the jth pair is equally likely to be Xi - Xi or
Xj - Xi. Second, the distribution of this treatment-control difference
for the jth pair is not normal but is a mixture of two normal distributions
222 4 One-Sample and PaIred-Sample Inferences Based on the Method of RandOlnlzatlOn

with means Jlj - Jl'J and Jl'J - Jlj, and possibly extremely small variance.
Thus, it may, in a sense, approximate a two-point distribution Xj =
± (Jl~ - Jl'J) = ± vj, say. Third, the result holds for any null hypothesis
that contains Hci, Le., any null hypothesis including all the distributions in
Hci (Problem 26). Fourth, the result does not imply that the test is a ran-
domization test based solely on the treatment-control differences. One might,
for example, use a composite or "adaptive" test with components that
depend only on the treatment-control differences but a rule for choosing the
component test that depends on the sums within pairs (Problem 27).

4 Most Powerful Randomization Tests


4.1 General Case

Reasons for using randomization tests were given in Sect. 3.2. In this sub-
section we will see how to find that randomization test which is most powerful
against any specified alternative distribution. The particular case of normal
alternatives will be illustrated in the two subsections following.
We consider the usual (2n-type) randomization distribution, under which,
given the absolute values IX d, ... , IXnI, there are 2n possible assignments
of signs, and hence 2n possible samples, which are equally likely. The implica-
tion of using a randomization test is that this condition holds under the
null hypothesis. Consider now an alternative distribution with joint density
or discrete frequency function J(x l , •.. , xn). Under this alternative, given
IX 1 I, ... , IX nI, the conditional probabilities of each of the 2n possible
samples Xl"'" Xn are proportional to J(XI"'" Xn). By the Neyman-
Pearson Lemma (Theorem 7.1 of Chap. 1), it follows (Problem 28) that among
tests with conditional level (X given IX 1 I, ... , IX nI, that is, among randomiza-
tion tests, the conditional power against J is maximized by a test of the form
reject ifJ(X I,,,,, Xn) > k
"accept" ifJ(X 1"", Xn) < k. (4.1)
Randomization may be necessary at k (that is, the test may be randomized).
The choice of k and the randomization at k must be determined so that the
test has conditional level exactly a.
In other words, the procedure for finding the conditionally most powerful
randomization test is to consider the value of J at each Xl' .•• , Xn having
the same absolute values as the observed Xl"'" X n • The possible samples
X I, ... , xn are placed in the rejection region in decreasing order of their
corresponding values off, starting with the largest value off, until their null
probabilities total (x. The region will consist of the a2 n possible samples
Xl' ... , Xn which produce the largest values ofJif (X2 n is an integer and if the
(cx2n)th and (cx2n + l)th values ofJare not tied. Ties may be broken arbitrarily.
If cx2" is not an integer, a randomized test will be necessary.
4 Most Powerful RandOlnJzatlon Tests 223

Since this test is the randomization test which maximizes the conditional
power against f, it also maximizes the ordinary (unconditional) power
againstf(Problem 28). Thus we have shown how to find the (2"-type) ran-
domization test which is most powerful against any specified alternative f
The conditions given are necessary and sufficient. The method for (n! 2")-type
randomization tests is similar.

4.2 One-Sided Normal Alternatives

Consider now the alternative that Xl"'" XII are a random sample from a
normal distribution with mean Jl and variance (J2. Then the joint density is

f( Xl' -
.•. ,X") -
nil _1_
M: e
-(Xj-Il) 2/2<12

j; 1 V 21t(J

For given IXli, ... , IXII I (and Jl and (J2), this f is an increasing function of
L X j if Jl > 0, and a decreasing function of L X j if Jl < O. Therefore, rejecting
if/eX 1, ... , XII) is one of the k largest of its possible values given IX 11, ... ,
IX II I is equivalent to rejecting if L X j (or an equivalent statistic) is one of the
k largest of its possible values given IX 11, ... , IX"I when Jl > 0, and if it is
one of the k smallest when Jl < O. Thus the result of Sect. 4.1 shows that the
upper-tailed observation-randomization test based on S = L Xj (or X) is
the most powerful randomization test against any normal alternative with
Jl > 0, and similarly for the lower-tailed test and Jl < O. In short, the one-
tailed randomization tests based on the sample mean are uniformly most
powerful among randomization tests against one-sided normal alternatives.

4.3 Two-Sided Normal Alternatives

Consider now the two-sided alternative that Xl, ... , XII is a random sample
from a normal distribution with mean Jl i= 0 and variance (J2. Since we found
in the previous subsection that different randomization tests are uniformly
most powerful against Jl < 0 and against Jl > 0, there is no uniformly most
powerful randomization test against this two-sided alternative. However,
the equal-tailed randomization test based on X is the uniformly most
powerful test against Jl i= 0, among unbiased randomization tests (Problem
29). Further, in the class of randomization tests which are invariant under the
transformation carrying X 1, ••• ,X" into -Xl"'" -Xn' we show below
that the equal-tailed randomization test is again the uniformly most powerful
test. This invariance means that the test is unaffected if the signs of all the
observations are reversed. Notice that this transformation carries the alter-
native given by Jl, (J2 into that given by - Jl, (J2, so the invariance rationale
224 4 One-Sample and PaIred-Sample Inferences Based on the Method of RandomIzatIOn

(Sect. 8, Chap. 3) can be applied. In particular, any invariant test has the
same power against }1, (12 as against -}1, (12.

PROOF. From the last sentence it follows that any invariant test has the
same power against }1, (12 as against the density h obtained by averaging the
density for }1, (12 and the corresponding density for -}1, (12, namely

Accordingly, if we show that the equal-tailed randomization test based on


X is the most powerful randomization test against h, it will follow that this
is the most powerful invariant randomization test against everY}1, (12. By (4.2)
we can write (4.3) as
h(x""" xn) = (2n)-n/2e-(l:x;+nIl2)/2a 2(e ll l: x )/a 2 + e- II l:x)/a 2)/2(1n
= (2n)-n/2(1-ne-(l:xI+nIl2)/2a2 cosh(}1 I: Xj/(12). (4.4)

For fixed Ix,I, ... , IXnl, (4.4) is an increasing function of 1}1 I: x)(121, and
hence of II: xjl (Problem 31). It follows from (4.1) that the most powerful
randomization test against h is that which rejects for large IL x j I. Since
this is simply the equal-tailed randomization test based on X, the proof is
complete. 0

The density h given in (4.3) may be interpreted as the marginal density of


a sample X I, ... , Xn from a normal population if the population mean is,
a priori, equally likely to be }1 or -}1. The proof uses this prior distribution
only as a device, but the in variance rationale itself breaks down if}1 and -}1
are not regarded in effect as equally likely a priori. Notice also that h is not
one of the alternatives originally under discussion (but a mixture of two),
and that h does not make X I, ... , Xn independent except conditionally.
(In terms of the interpretation involving prior probabilities, X I' ... , X n
are conditionally independent given that the mean is Jl, or -}1, but they are
not marginally independent.)

5 Observation-Randomization versus
Rank-Randomization Tests
We have found that the one-sample observation-randomization test of
Sect. 2 is valid with almost no distribution assumptions, and it is the most
powerful randomization test against normal alternatives. One can show
5 ObservatIOn-RandomizatIOn versus Rank-RandomizatIOn Tests 225

that the performance of this nonparametric test under normal alternatives


is almost as good as that of the corresponding most powerful parametric
test whose validity is dependent upon the normal distribution. Since the
nonparametric test is completely valid under much less restrictive assump-
tions, one could argue that there may be much to gain and little to lose by
using this non parametric test. However, the Wilcoxon signed-rank test is
much more popular in applications that this observation-randomization
test (or any other one-sample nonparametric test), even though it is only
locally most powerful among signed-rank tests against alternatives which are
only close to normal.
In fact, observation-randomization tests are seldom used in practice,
while rank-randomization tests often are. Rank-randomization tests are
probably used primarily because they are so convenient in practical applica-
tions. The observation-randomization test criteria of Sect. 2 depend heavily
on the actual values of the IX}I. Accordingly, the general one-sample
randomization distribution cannot be tabled for arbitrary samples of fixed
size. However, the randomization distribution of a test criterion based on
signed ranks is constant for all samples of fixed size for which there are no
ties among the absolute values. Thus the null distribution can be tabled for
any rank-randomization (signed-rank) test. Further, particularly for some
of them, confidence interval procedures corresponding to rank-randomiza-
tion tests are easier to use.
The convenience of rank-randomization tests is not so important with the
availability of computers. A computer program could be used to generate
or simulate the complete general randomization distribution for any set
of data, and hence the distribution of an observation-randomization test
statistic or confidence bound. The cost of a little extra effort in data analysis
is small compared to the other costs of many experiments.
Besides convenience, however, rank-randomization tests frequently have
good performance properties. They may be highly efficient, and more
powerful than observation-randomization tests against many alternatives
(see Sect. 8, Chap. 8). They are less sensitive to outliers. Further, as mentioned
in Sect. 8, Chap. 3, in most situations little is lost by treating alike practically
all samples which have the same signed ranks. These are perhaps sufficient
reasons for using rank-randomization rather than observation-randomiza-
tion procedures in most situations.
Observation-randomization tests are nevertheless of considerable theo-
retical and historical interest. They provide important examples of the basic
spirit of general nonparametric statistics in that the exact probability dis-
tribution can always be found in the null case without making specific
assumptions about the popUlation and a strong type of optimality can be
achieved under normality. Even more important historically is that observa-
tion-randomization tests provided the seed of the basic idea for rank-
randomization tests and hence may have been instrumental in their develop-
ment since 1935.
226 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization

PROBLEMS

L
I. (a) Express X, S, S + , and S- each in terms of each of the others and 1Xli.
(b) Show that the randomization tests based on these statistics are equivalent.

2. (a) Find the one-tailed P-value of the Wilcoxon signed-rank test for Manis' data
given in Table 2.1.
(b) For Manis' data, find lower and upper confidence bounds corresponding to
the Wilcoxon signed-rank test at the one-sided level IX = 10/1024 for an assumed
center of symmetry.
(c) Do (b) for the randomization test based on X.
(d) How would you interpret the confidence bounds in (b) or (c)?
(e) In Manis' data, some subjects had smaller initial" distance" than others, and
hence less possibility of reducing the distance over time. How does this affect the
various randomization tests of the null hypothesis of symmetry about O? What
alternative procedures might be considered, with what advantages and dis-
advantages?

3. For the data in Sect. 3.1, Chap. 3, Fisher [1966 and earlier editions] obtained the
one-tailed P-value of 0.02634 for a randomization test based on the t statistic. How
many values of t as small as or smaller than that observed must he have counted?
4. Show that the randomization distributions of X, S, S+, S-, and t are all symmetric.

5. Consider the null hypothesis Ho that the observations X I" .. , X" are independently
distributed, symetrically about 0, and the randomization test that rejects H 0 when
K ~ k for some chosen integer k, where K is the number of different assignments of
signs for which L, ± Xj ~ L Xl (including L Xl itself). Show that
(a) This test has level IX ~ kI2", irrespective of whether the observations are iden-
tically distributed. The P-value is KI2".
(b) This test has level IX = k12" if the observations are also continuously distributed.
6. Consider a matched-pairs experiment with the null hypothesis of no treatment
effect on any observation. Show that randomization within pairs induces the
randomization distribution of the treatment-control differences whatever the
paired observations may be.
7. Show that the confidence bounds corresponding to the upper-tailed randomization
test based on X are
(a) X(I) at level 1/2",
(b) (X(I) + X(2)/2 at level 2/2";
(c) mm{X(31' (X(1) + X(2)/2} at level 3/2",
where X(I) is the ith order statistic of a sample of size II. Compare Problem 36,
Chap. 3.

*8. Given sample observations X I' ... , X"' let K be the number of different assignments
of sIgns for which L ± Xj ~ ~ Xj (including ~ Xj itself). Show that
(a) The number of nonnegative subsample means is K - 1. (Hint: A subsample
total equals (L, Xl - ~ ± X)/2 where a - is assigned if Xl is in the sub-
sample and a + is assigned otherwise.)
(b) The confidence bound for center of symmetry II corresponding to the ran-
domization test that rejects for K ~ k is the kth smallest subsample mean.
Problems 227

(c) The test that rejects for K ~ k has level (J. ~ k/2" if the X) are independent and
symmetric about 0 (not necessarily identically distributed). The P-value is
K/2". If the X) are also continuously distributed, then (J. = k/2".
(d) If the X) are independent with a distnbutlOn that is continuous and symmetric
about J1., then the 2" - 1 subsample means all differ with probabihty one, and
they partition the real line into 2" intervals that are equally likely to contain J1..
*9. (Randomization tests with randomly chosen subsets of assignments of signs)
Let X I> ••• , X" be independent observations with a distribution that is symmetric
about 0 under the null hypothesis Ho· Define Yo = L) Xj and Y; = L ± X) =
L ± IX)I for i = 1, ... , m, where the signs are drawn at random either (i) un-
restrictedly, or (ii) without replacement from all 2" possible assignments except
that corresponding to Yo. Let K denote the number of values of i for which Y; ~ Yo
for 0 ~ ; ~ Ill, let R, denote the set of values ofj for wlllch the sign of X) differs 111 the
sums for Yo and y., and let Z, denote the mean of the X I for which} is in R, for R,
nonempty. Show that
(a) The two sums given above to define y. are equivalent.
(b) Yo is a random sample of size one from the order statistics of Yo, Yj , ••• , Ym under
Ho·
(c) The test that rejects H 0 if K ~ k has level (J. ~ k/(111 + 1) and P-value equal to
K/(m + 1).
(d) The R" i = 1, ... , Ill, are a random sample drawn with replacement from all 2"
possible subsets of the set {I, 2, ... , n} in case (i) above, and without replacement
from all 2" - 1 nonempty subsets in case (ii) above.
(e) If H is the number of nonnegative Z, and E is the number of empty R, for I ~
i ~ m, then K = H + 1 + E. In case (ii) above, E = O.
(f) The confidence limit for center of symmetry J1. that corresponds to the test that
rejects for K ~ k is the (k - E)th smallest Z, for 1 ~ i ~ m (which is the kth
smallest in case (ii) above).
(g) If the X) have a continuous distribution that IS symmetric about J1., then in
case (ii) the level of the test in (c) is (J. = k/(m + 1) and the Z, partition the real
line into III + 1 intervals that are equally likely to contain J1.. (This result can
also bc proved from Problem 8d using Problem 10 below.)
*10. Let WI < '" < WN be continuously distributed with P(w, > 0) = i/(N + 1) for
1 ~ i ~ N. Let W'I < ... < W;" be the order statistics of a sample of m drawn without
replacement from WI' ... , WN • Show that P(W: > 0) = i/(m + 1) for 1 ~ i ~ m.
(Hint: One method is to consider the case where the W, are obtained by subtracting
from the order statistics of a sample an additional observation from the same
distribution).
11. (a) Fmd the center of symmetry and the variance of the randomization distnbu-
tions of each of X, S, S+, S-, and t.
(b) How are the moments of X, S, S + , and S - related? Why do the moments of t
not have a simple relationship to these moments?
12. Let G be the set of all n-dimensional vectors J = (J I' ... , J") of I's and - 1's such
that an even number of elements are equal to -1. Let Gx be the set defined 111
procedure (b) of Sect. 2.4. Show that
(a) G has 2"-1 members
(b) If no X) = 0, then Gx also has 2"-1 members.
228 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization

(c) If J and J' both belong to the set G, then the vector (J I J'I> ... , J n J~) also belongs
toG.
(d) Given that X belongs to a particular set Gy , all members of G y are equally likely
under the randomization dlstnbutlOn. What If X does not belong to G y '?
* 13. Let G be the set of all n-dimensional vectors of l's and - l's such that the first two
elements are not both equal to -1. Let Gx be the set defined m procedure (b) of
Sect. 2.4. Show that
(a) G has 3(2,,-2) members. (G does not satisfy Problem 12c.)
(b) X = (X" ... , X") has the smallest or second smallest mean among members
of Gx if all X) > 0 except possibly X I or X 2 but not both.
(c) If no X) = 0, then under the randomization distribution the probability is at
least 3/2" that X has the smallest or second smallest mean among members of
Gx . (If X could be treated as a random member of Gx , the probability would be
(8/3)/2", which is always smaller than 3/2".)
*14. (a) Show that the confidence bound corresponding to a randomization test based
on X with a randomization set restricted by means of a group G as described
in procedure (b) of Sect. 2.4 is the kth smallest or largest subsample mean
among subsamples corresponding to members of G, where k is the critical
value of the randomization test and the subsample corresponding to a vector J
in G consists of those X) for which J) = -I.
(b) What operation on the subsamples corresponds to the group multiplication
in G?
15. (a) Make a small table to compare the values of Z.[(II - 1)/(11 - zDJ '12 , t., and
Z., where Z. is the upper (J. quantile of the standard normal distribution and
f. is that of the Student's r distribution with II - 1 degrees of freedom. (A good
picture can be obtained from the values (J. = 0.10, 0.05, 0.025, 0.01; II = 3, 6,
10,20.)
*(b) How do the values of f., Z.[(II - I)/(n - Z;)]112, and Z. compare for large
sample sizes?
(c) Find the ranges of the standardized randomization test statistic in (2.4) and the
( statistic in (2.7), given the sample absolute values. Find the ranges uncon-
ditionally.
(d) What do the ranges found in (c) imply about the normal approximation (2.4)
to the randomization test?
*(e) Show that the values of (J. for which f. - z.[(n - 1)/(11 - Z;)]1/2 is of smaller
order than lin as n --> 00 are the (J. values such that Z. = 0, ±ji
16. (a) Find the moments of the randomization distribution of (2 and compare them
with the corresponding moments of the F distribution with 1 and (II - I)
degrees of freedom.
(b) Find the moments of the randomization distribution of t 2/(t2 + II - I) and
compare them with the corresponding moments of the beta distribution with
parameters 1/2 and (n - 1)/2.

17. (a) Show that the expression for d in (2.10) is equivalent to the expression in (2.9).
(b) Show that b2 , as defined by (2.11), is a consistent estimator of the kurtosis of
the distnbution of the Xj under the null hypothesis of a distribution symmetric
about zero.
Problem!> 229

*18. (a) Derive the first four moments of the randomization distribution of the stallstic
Z defined by (2.4).
(b) Find the parameters of the beta distribution whose first two moments are the
same as the correspondmg moments of the randomization distribution of Z2/1I.
(c) Show that approximating the randomization distribution of Z2/11 by the beta
distribution in (b) is equivalent to approximating the randomization distribu-
tion of t 2 = (II - l)Z2/(1I - Z2) by an F distribution and find the degrees of
freedom of this F distribution.
(d) Let V have the F distribution with degrees of freedom 1 and k. Find the values
of c and k such that c 2 V has the same first two moments as the F distribution
in (c).

19. Apply the approximations of (c) in Sect. 2.5 to the Darwin data in Sect. 3.1 ofehap. 3
to venfy the results gIven in the text.
*20. (a) Give at least two possible definitions of convergence in distribution for con-
dItional distributions.
(b) Show that the definitions you gave in (a) imply the usual convergence m
distribution of the marginal distributions.
(c) Show that the converse of (b) does not hold.
(d) Show that the ordmary central limit theorems do not apply to these definitIOns.
21. Show that a randomization test at level IX has unconditional level IX under the null
hypothesis Ho of a distribution symmetric about zero if the observations are
independent and identically distributed.
22. Show cxplicltly, 111 terms of the expectatIOn of the cntJcal function 4> under ran-
domIzatIOn, that
(a) Any 2n-type randomization test is an (II! 2n)-type randomization test, but not
conversely.
(b) A permutation-invariant test has the same level under both types ofrandomiza-
tion.
*23. Show that for the null hypothesis H'O: XI' ... ' Xn are independent with a common
distribution symmetric about zero, if a test is unbiased against the alternative
H';: XI' ... , X n are independent with a common distribution that is symmetric
about some point It i' 0 (or It > 0), then that test is an (II! 2n)-type randomization
test. (Remember that the common distribution need not be continuous.)
*24. Show that, in Problem 23, if unbiasedness is required only against continuous
alternatives, then the test need not be a randomization test for all discrete dis-
tributions. Why does this result not really contradict the results given in Sect. 3.2?
25. Show that the order statistics of the absolute values of the observations are sufficient
stallstics for a sample from an arbitrary distribution that is symmetric about zero.
*26. In a matched pairs experiment, show that if a test has level IX for a null hypothesis
con taming H;j as stated in Sect. 3.2, then this test is a conditional test given the
observations X~ and X;.
27. In a matched pairs experiment, give an example of a test which is conditional on
the observations Xj and X; but does not depend only on the treatment-control
differences. Why might such a test be desirable?
230 4 One-Sample and Paired-Sample Inferences Based on the Method of Randomization

28. Show that, both conditionally and unconditionally, the most powerful randomiza-
tion test against an alternative density or discrete frequency functionf(xl> ... , x.)
is of the form given in (4.1).
*29. (a) Show that the equal-tailed randomization test based on X is a uniformly most
powerful unbiased randomizatIOn test agamst the alternatIve that the observa-
tions are a random sample from a normal distribution with mean JI ¥- 0 and
variance a 2 •
(b) Using the fact that the test in (a) is uniformly most powerful among tests which
are invariant under reversal of signs, show that it is the most stringent ran-
domization test against the same alternative.
*30. Show in general that under appropriate conditions
(a) If a uniformly most powerful invariant test and a uniformly most powerful
unbiased test both exist, then they must be the same test.
(b) If a uniformly most powerful invariant test exists, then it is most stringent.
31. Show that, for fixed lXII, ... , Ix.l, the function h(xl> ... , x.) given in (4.4) is an
L
increasing function of I x J I·
32. Let X I' ... , X. be independent and identically distributed with density f(x) =
(I/a)exp{ - p([x - OJ/a)} where p is a known symmetric function and a IS a known
scale factor. Let the null hypothesis be H 0: 0 = O.
(a) Find the most powerful randomization test of Ho against the alternative
HI: 0 = 0 1 for 0 1 specified.
*(b) Under what circumstances does there exist a uniformly most powerful ran-
domIzation test against the alternative H'I : 0 > O.
(c) Show that the (locally) most powerful randomization test agamst the alternative
H'I: 0 > 0 for small 0 can be based on the statistic LJ \I'(x/a) where \I'(x) =
p'(x) = dp(x)/dx. (See Sect. 9, Chap. 3.)
(d) Show that the tests in (a) and (c) are valid even if the a and p assumed are
IIlcorrect.
(e) Show that the maximum likelihood estimate &of 0 satisfies LJ \I'([xJ- OJ/a) =
o where \I'(x) = p'(x). (Estimates of this form appear significantly in work of
Edgeworth [1908-9] and Fisher [1935] and were named M-estimates in Huber
[1964], which studies them extensively.)
(0 Show that the estimate that corresponds (in a SUItable sense) to the test in (c)
is the maximum likelihood estimate.
33. WIth reference to Problem 32,
(a) If a is unknown, how could it be estimated without affecting the validity of
the randomization tests in (a) and (c)?
(b) If in addition p is not fully known but has some unknown shape parameters,
how could they be estimated without affecting the validity of the randomization
tests in (a) and (c).
(c) What estimates would correspond to the tests III (a) and (b)?
CHAPTER 5
Two-Sample Rank Procedures
for Location

1 Introduction
The previous chapters dealt with inference procedures applicable in one-
sample (or paired-sample) problems. We now consider the situation where
there are two mutually independent random samples, one from each of two
populations. We discuss tests which apply to the null hypothesis that the
two populations are identical and the confidence procedures related to these
tests.
In choosing an appropriate test from among those available for the null
hypothesis of identical populations, consideration should be given to the
alternative hypothesis, since different tests are sensitive to different alter-
natives. The alternatives may be simply that the two populations differ in
some unspecified way, but frequently some specific type of difference is of
particular interest. A general alternative which is frequently important is
that the observations from one population tend to be larger than the observa-
tions from the other (" stochastic dominance "). A particular case of this
relationship occurs when the populations satisfy the shift assumption, which
is explained explicitly in the next section. Frequently, the difference in
"location" between the two populations is of primary interest. Under the
shift assumption, this difference is the same whatever location parameter is
chosen, and is the amount of shift required to make the two populations
identical. Furthermore, we can develop confidence procedures (correspond-
ing to the test procedures) which give confidence intervals for this difference
in location, or shift.
The primary discussion of two-sample tests in this book is divided among
three chapters. The median test, tests based on rank sums, and more general

231
232 5 Two-Sample Rank Procedures for LocatIOn

rank-randomization tests are discussed in this chapter, observation-ran-


domization tests in Chap. 6, and Kolmogorov-Smirnov tests in Chap. 7.
Chapter 8 discusses asymptotic relative efficiency generally, and specifically
for both one and two-sample procedures.
Once the tests have been developed, it will be evident that they retain
their significance levels and hence remain valid even when the basic assump-
tions are relaxed in certain ways. For instance, even independence of the
observations is not required for tests of no treatment effect as long as the
assignment to treatment or control groups is appropriately random. While
these two-sample procedures do not require normal populations, they do not
provide protection against inequality of variance, or other differences in
shape, even in the null case.

2 The Shift Assumption


Suppose that Xl"'" Xm and Y1 , ••• , y" are mutually independent sets of
observations drawn from two populations. We say that the Y population
is the same as the X population except for a shift (the shift assumption) by
the amount Jl if

P(X ~ x) = P(Y ~ x + Jl) for all x. (2.1)

This condition is equivalent to saying that X and Y - Jl have the same


distribution, or that Y is distributed as X + Jl. In terms of c.d.f.'s, this relation
is
F(x) = G(x + Jl) for all x, (2.2)

where F and G are the cumulative distribution functions of X and Y respec-


tively. Under the shift assumption, the cumulative distribution function of
the Y population is the same as that of the X population but shifted to the
left if Jl < 0, and to the right if Jl > 0, as in Fig. 2.1 (a) and (b) respectively.
If X and Y have densitiesJand g, then (2.1) or (2.2) is equivalent to

J(x) = g(x + J1) for all x. (2.3)

Two arbitrary density functions which satisfy (2.3) are shown in Fig. 2.1 (c).
The shift assumption means that the two populations have the same
shape, and in particular their variances must be equal if they exist (Problem
1). Two normal populations with the same variance satisfy the shift assump-
tion, but two normal populations with different variances do not, nor do two
Poisson or exponential populations with different parameters.
If the shift assumption holds, then Jl, the amount of the shift, must equal
the difference between the two population medians. It must also equal the
2 The ShIft AssumptIon 233

x
(a)

x
(b)

(c)

Figure 2 I (a) F(x) = G(x + II), F normal, II < O. (b) F(x) = G(x + II), F exponential,
II > O. (c) f(x) = g(x + II), II <O.

difference between the two population means, and indeed the difference
between the two values of any other location parameter (if it exists), such
as the mode, the midrange, or the average of the lower and upper quartiles.
The mean need not equal the median (or any other location parameter) in
either population, but the difference between the mean and the median must
be the same for both popUlations. Since the populations are the same except
for location, the difference in location is the same however it is measured,
and it equals the shift J..l (Problem 2). For this reason, the shift parameter is
sometimes called the location parameter.
The confidence procedures in this chapter are developed under the shift
assumption, and accordingly they provide confidence intervals for J..l, the
amount of the shift. If the shift assumption fails badly, the procedures will
not perform as advertised since the confidence level will not ordinarily be
valid.
On the other hand, the test procedures here can be developed and justified
logically without assuming that the shift assumption (or any other relation-
ship between the distributions) holds under the alternative hypothesis. The
234 5 Two-Sample Rank Procedures for LocatIOn

tests retain their level as long as F = G under the null hypothesis. Thus,
while" acceptance" of this null hypothesis may be interpreted as not rejecting
the possibility that the shift assumption holds with fl = 0, rejection of the
null hypothesis implies no inference about whether the shift assumption
holds for any other fl. Furthermore, the tests developed here appear as if
they would be good against alternatives which include more than shifts,
and certain mathematical properties to be discussed provide justification for
this view. Of course similar statements apply to parametric tests of a dif-
ference in location for two otherwise identical populations, including the
normal theory test for equal means, so these points are not new or special to
non parametric tests.

3 The Median Test, Other Two-Sample Sign Tests,


and Related Confidence Procedures
The sign test discussed in Chap. 2 for the one-sample or paired-sample
problem can be adapted to the two-sample problem in various ways to test
the null hypothesis of identical distributions. Tests of this nature, which may
be called two-sample sign tests, are covered in this section. The data relevant
for a two-sample sign test can be presented in a 2 x 2 table whose entries are
observed sample frequencies. The test is then carried out by Fisher's exact
test, the general method appropriate for such data. One especially important
two-sample sign test is usually called the median test; this test and its cor-
responding confidence procedures will be discussed in particular detail
here.

3.1 Reduction of Data to a 2 x 2 Table

Suppose we have a set of m X observations and a set of n Yobservations, and


hence a total of m + n = N observations in the two sets combined. The data
can be summarized by a 2 x 2 table if all the observations are dichotomized.
Consider the following two methods of dichotomizing the observations
according to size:
(1) A particular number ~ is selected before looking at the data, and each
e"
observation is classified as being either "below or "above ~." (For definite-
ness, we define "below" as "strictly less than" and above as "greater than
or equal to." This ensures a unique and exhaustive dichotomy.)
(2) A particular integer t is selected and the t smallest observations in
both sets combined are classified as "below," the rest as "above." (This
definition is incomplete in certain cases of ties; see Sect. 3.3.)
In either case, the dichotomized sample data can be presented in a 2 x 2
table like that shown in Table 3.1. Note that the numbers A and B determine
3 Medlall Test, Other Two-Sample Sign Tests, Related Confidence Procedures 235

Table 3,\
X's Y's

Below A B
Above /II-A Il-B N - t

III n N

the entire table, since the column totals, m and n, are fixed by the sample
sizes. In case (2), since t is also fixed by the procedure, A alone determines the
entire table.
In case (1), A and B are determined by simply comparing each observation
with the fixed ~. In case (2), the most straightforward procedure is to combine
both sets of observations into a single array but keep track of which observa-
tions are X's and which are Y's. Then A and B are the number of X's and Y's
respectively among the t smallest in the single combined array.
For any given data set, the same 2 x 2 tables can be obtained by method
(1) with various choices of ~ and by method (2) with various choices of t.
Specifically, the t in a table obtained by method (I) gives the same table by
method (2). Further, any method (2) table is also given by method (1) if ~
is selected in such a way that exactly t observations are "below C' that is,
if ~ is any number greater than the (t)th smallest observation and smaller
than or equal to the (t + l)th smallest observation. (If no such ~ exists
because of ties, then t is not a possible value by either method.) Note that
method (2) does not require any consideration of ~ at all, however.
For example, consider the two samples! below where m = 10 and n = 12.

XI: 13.8, 245, 20.7, 22.5, 26.5, 14.5, 6.4, 20.0, 17.1, 15.5

~: 16.2, 23.9, 24.3, 17.8, 15.7, 14.9, 6.1, 11.1, 16.5, 17.9, 15.3, 14.3

In case (1), if we take ~ = 18.0, we find that 5 of the X's and 10 of the y's are
less than 18.0 so that A = 5 and B = 10; this gives Table 3.2. For case (2),

I These data are from Mosteller, F. and D. Wallace [1964, Sect. 4.8 at pp. 174-175], Inference
and Displ/ted AI/thors/llp. The Federalist, Addison Wesley Publishing Co., Readmg, Mass.
The Y's are scores computed m a certam way for the 12 "Federalist Papers" whose authorship
IS m dispute between Hamilton and Madison. More specifically, ~ is the natural logarithm of
the odds provided by the data m favor of Madison's havmg WrItten the }th disputed paper,
under certam assumptions about the underlying model, except that it has been adjusted to allow
for the length of the paper. The X's are scores computed III the same way for 10 segments of
about 2,000 words each taken from material known to be by Madison With the adjustment
for length, the X's and Y's should come from approximately the same population If the model IS
reasonably good and if the disputed papers are by Madison. If the X's and Y's are not from the
same population, this by no means indicates that the disputed papers are by Hamllton--the
Y's are vastly different from the scores for Federalist Papers known to be by Hamilton. The
indicatIOn would be rather that something remains to be explained, perhaps an inadequacy in
the model. The adequacy of the model is explored extensively by Mosteller and Wallace.
236 5 Two-Sample Rank Procedures for LocatIOn

Table 3.2 Table 3.3


X's f's X's f's

~
5
~
< 18.0 1O 15 Below 8
~ 18.0 5 2 7 Above 7 7 14

10 12 22 10 12 22

if we take t = 8, we find that, of the eight smallest observations in both


sets combined, 3 are X's and 5 are Y's; this gives Table 3.3. The same table
would be obtained by method (1) for any ~ in the interval 15.3 < ~ ~ 15.5.
If the X's and Y's have the same distribution, then for any number
~, we have P(X < ~) = P( Y < ~). This of course is equivalent to saying that
for some fraction p, ~ is a quantile of the same order p in both populations.
For method (1), where ~ is a preselected constant, it follows that A and B,
the respective numbers of "successes" (observations less than ~), are inde-
pendently binomially distributed with the same parameter p = P(X < ~) =
P(Y < ~). The standard test for the equality of proportions then provides
a test of identical distributions. When we apply the standard test for equality
of proportions to a table obtained by method (1), we call it the two-sample
sign test with fixed ~. The standard test for equality of proportions (Fisher's
exact test or an approximation) is conditional on the observed value of
A + B = t. Under this conditioning, the marginal totals in Table 3.1, that
is, m, n, t, and N - t, are all fixed. Hence once one of A, B, C = m - A, and
D = n - B is known, the others are determined, and the test may as well be
based on the conditional distribution of say A given t.
In the two-sample sign test with fixed ~, the dichotomizing point ~ is a
constant chosen without knowledge of the observations. An unfortunate
choice of ~ might produce a test with very low power. For example, if ~ is
chosen too small, then most of the observations are likely to be above ~;
this makes t small and leaves A with a small range of possible values, which
suggests that A will not be a powerful test statistic.
If we do not require that ~ be chosen in advance, a natural choice for ~
is some particular quantile of the combined sample of X's and Y's. As long
as there are no tied observations, such a choice fixes t rather than ~ (as in
method (2) above), since a sample quantile is an order statistic of fixed
order. However, as noted previously, fixing ~ leads eventually to conditioning
on t, albeit at a value depending on the data. If we are going to condition on
t eventually anyway, might we not fix t initially as well?
A test based on A for preselected t (method (2» is called a quantile test.
When t is selected so that the observations are dichotomized at the combined
sample median, the test is called the median test. Notice that the marginal
totals in Table 3.1 are again fixed, so that any cell determines the entire table.
Furthermore, the null distribution of A given t is the same as before. (Indeed,
this conditional null distribution applies also when the choice of t (or ~) is
3 MedIan Test, Other Two-Sample SIgn Tests, Related Confidence Procedures 237

not made in advance but is based on the combined sample, as will be dis-
cussed.)
Specifically, if there are no ties at the combined sample median, the median
test for N even is equivalent to always choosing the value t = N /2; for N odd
it is equivalent to t = (N - 1)/2 if the combined sample median is counted
as" above" rather than" below," as.it is by our earlier, arbitrary convention.
Ties will be discussed in Sect. 3.3.
As an example, we develop the 2 x 2 table that arises when the median
test is applied to the Mosteller and Wallace data given earlier in this section.
Since N = 22, we choose t = 11. The smallest t = 11 observations in the
combined sample include four X's and seven Y's, as Table 3.4 shows. Fisher's
exact test or an approximation may be applied to this 2 x 2 table. We need
not consider the combined sample median explicitly, but it is any number
between the eleventh and twelfth observations in the ordered pooled sample.
These observations are 16.2 and 16.5 respectively. Dichotomizing at say
16.4 leads to Table 3.4, as would dichotomizing at any other ~ in the interval
16.2 < ~ :-: ; 16.5.
The two-sample sign test for fixed ~ is not a rank test because it does not
depend only on the ranks of the two samples. This test would be appropriate
to use when the measurement scale has only a small number of possible
(or likely) values, since then it is natural to choose ~ equal to the central
value expected.
The two-sample median and other quantile tests are particularly useful
in analyzing data related to experiments involving life testing because they
permit termination of the experiment before all units under test have ex-
pired. The information needed to perform the test is complete once t units
have expired, and sometimes well before that (Problem 5). The control
median test and the first-median test are variants of the two-sample quantile
tests with particular forms of termination rules. These variants reach the
same decisions as a two-sample quantile test (usually earlier) and hence
coincide if sampling is terminated as soon as a decision is reached, except
that the two tails may correspond to different two-sample quantile tests
(Problems 6 and 7). Rosenbaum [1954] gives a test which is equivalent to a
special case of a two-sample quantile test since it is based on the number of
observations in one sample that are larger than the largest value in the other
sample.

Table 3.4
X's Y's

Below combined median 4 7 II

Above combined median 6 5 II

10 12 22
238 5 Two-Sample Rank Procedures for LocatIOn

3.2 Fisher's Exact Test for 2 x 2 Tables

The general test known as Fisher's exact test provides a method for analyzing
2 x 2 tables like Table 3.1 that arise in two-sample sign tests. In Fisher's
test, the marginal totals m, n, t, and N - t are all fixed, either initially or after
conditioning, and the test is based on the conditional distribution of A
given t.
Assume that the X's and f's that produced Table 3.1 are independent.
Under the null hypothesis that the X's and f's have the same distribution,
the data in Table 3.1 represent N independent observations from a single
e
population. As long as either or t is preselected, for any given t, m, and n all
subsets of size t are equally likely to be the subset containing the t smallest
observations, and hence any set of t observations out of the N is as likely as
any other set of t to constitute the observations in the first row of Table 3.1.
It follows that the conditional distribution of A given t is the hypergeometric
distribution (Problem 8), with discrete frequency function given by

f(A 1m, n, t) = (:) (t ~ A)/(~). (3.1)

This is true, regardless of underlying distributions, if a given set of N


units is separated into two samples by randomization (Problem 8).
The test based on this null distribution is referred to generally as Fisher's
exact test. Other tests, such as the chi-square test for the equality of two
proportions, are really approximations to the one based on (3.1). Fisher's
exact test is appropriately applied in situations which may arise in the fol-
lowing three conceptually different ways.2
(1) (All marginal totals fixed initially). The margins may all be fixed
by the rules leading to the table, as in a quantile test, such as the median
test, where m and n are the sample sizes and the dichotomization is made so
that t will have a certain value.
(2) (Equality of proportions). A and B are binomial random variables
based on samples of size m and n respectively, and each has the same param-
eter p under the null hypothesis. Then conditioning on A + B = t fixes all
the marginal totals. This situation includes the two-sample sign test with
e,
fixed introduced in Sect. 3.1.
(3) (Double dichotomy). A single sample of N observations may be di-
chotomized in two distinct ways (for example, sex and employment status),
the two dichotomies being independent under the null hypothesis. Then all
the marginal totals are random variables, but conditioning on m and t
fixes all the margins. This case does not arise in two-sample tests.
In our situation, as long as the choice of t in (1), or the dichotomizing
e
point in (2), depends only on the combined sample observations and not

2 There are, however, 2 x 2 tables for which Fisher's exact test IS not appropriate. (See Sect. 8,
Chap. 2 for examples)
3 MedIan Test, Other Two-Sample SIgn Tests, Related Confidence Procedures 239

on which observations are X's and which are Y's, the null distribution of A
given t is given by (3.1).
The one-tailed P-value of Fisher's exact test is the cumulative probability
in (3.1) of the observed A or less for the left tail, and the observed A or more
for the right tail; these are tail probabilities in the hypergeometric distribu-
tion.
Tables of the hypergeometric distribution are available [for instance,
Lieberman and Owen, 1961], but they are necessarily bUlky. In order to
perform the median test in the absence of ties, only one value of t is required
for each combination of m and n, so that more convenient tables are possible.
Table E is designed for use with the median test for t = N /2 if N is even and
t = (N ± 1)/2 if N is odd, when m :::;; n. For example, it applies to Table 3.4
and gives a P-value for A :::;; 4 of 0.335. If m > n the designations X and Y
can be interchanged so that A still represents the observed number "below"
in the smaller sized sample. This is actually equivalent to basing the test on
B instead of A, with large B corresponding to small A. If other values of t
are required, as when there are ties at the combined sample median or
naturally dichotomous observations (see Sect. 3.3), Table E cannot be used.
Notice that A is symmetrically distributed for t = N/2.
In the absence of tables or outside the range of available tables, we must
use a computer program or approximations to the null distribution. The
most common approximation is based on Z2, the chi-square statistic cor-
rected for continuity, or on its signed square root, Z, which is approximately
the normal deviate corresponding to the one-tailed P-value and can there-
fore be referred to Table A. An advantage of Z is that it reflects the direction
of the sample difference; Z2 masks this direction, and hence can only be
used for two-tailed tests. Formulas for Z are

1 mt) [
Z= ( A+--- N3 ] 1/2
(3.2a)
- 2 N mnt(N - t)

[
= A(n - B) - (m - A)B ± 2NJ [mnt(NN _ t)
]1/2
, (3.2b)

where the ± term represents the continuity correction and is to be taken


as + if A is included in the lower tail and - if in the upper tail. For the data
of Table 3.2, Equation (3.2b) gives

22 J1 /2 -1.212,
Z = [5(2) - 5(10) + 11] [ 10(12)(15)(7)

corresponding to a normal distribution tail probability of 0.113 from Table A.


The correct value is also 0.113, but such close agreement is unusual. Fisher
and Yates [1963, Table VIII] give the critical values of Z for the exact test
at a = 0.025 and 0.005. These values indicate that the normal approximation
240 5 Two-Sample Rank Procedures for Locatton

is quite accurate in the case of the median test provided the smaller sample
size is at least 12. IfmlN and tiN are both far from t, however, the approxima-
tion is not very accurate for one-tailed probabilities.
The test based on chi-square is popularly known as "the" test for 2 x 2
tables, but it is really just an approximation to Fisher's exact test. The exact
test seems to be less frequently used, probably because tables of the chi-
square distribution are much more accessible and sample sizes are frequently
large anyway.
*Several binomial approximations are also available, but they require
the use of binomial tables which are themselves limited by having two more
parameters than normal tables. They work best when the table is rearranged
so that the two margins opposite A are the two smallest margins. That is, if
necessary, interchange the columns to make m :c;; n and the rows to make
t ~ N - t. The simplest binomial approximation is to treat A as binomial
with parameters
max(m, t)
nl = min(m, t), PI = (3.3)
N

This amounts to treating the largest margin as belonging to an infinite


sample. It matches the actual mean and range of A, but gives a variance too
large by omission ofthe finite population correction factor (N - nl)/(N - 1).
A second approximation [Sandiford, 1960] matches the mean and ap-
proximately the variance, but not the upper limit. The procedure is to treat
A as binomial with parameters

. mt(N - 1) mt
n2 = an mteger near N(N _ 1) _ n(N _ t)' P2 = n2 N ' (3.4)

An integer value of n2 is used solely to facilitate entering binomial tables.


Ord [1968] gives a correction factor that improves this approximation.
The probability of A or less may also be approximated by the correspond-
ing binomial probability with parameters

max(m, t) - AI2
n3 = min(m, t) = n l , P3 = . (3.5)
N - (n3 - 1)/2
This is not equivalent to treating A as binomial, since P3 depends on A.
For this reason, it is not easy to see what mean and variance this approxima-
tion assigns to A, but it obviously gives the correct range. This is the first
approximation given by Wise [1954], and is actually an upper bound on the
probability when A = O. It is based on approximating the sum of hyper-
geometric probabilities by the incomplete beta function plus a correction
factor.
To apply these three approximation to data in Table 3.2, we first rearrange
it in the form of Table 3.5.
3 Median Test, Othel Two-Sample SIgn Tests, Related Confidence PIOccdures 241

Table 3.5
X's Y's

~ 18.0 5 2 7
< 18.0 5 10 15

10 12 22

We then find from (3.3), (3.4) and (3.5):

10
PI = 22 = 0.4546, P-value = 0.159;

10(7)
n2 = 5, P2 = 5(22) = 0.6363, P-value = 0.104;

10 - 2.5
P3 = 22 _ 3 = 0.3947, P-value = 0.091.

As mentioned earlier, the exact P-value is 0.113. Here the second approxima-
tion is better than the third. In other cases, the third approximation may be
better than the second, and both are almost always better than the first.
These and other approximations based on the binomial, Poisson, normal
and other distributions are discussed more fully in Lieberman and Owen
[1961] and Johnson and Kotz [1969]. Peizer, extending Peizer and Pratt
[1968J and Pratt [1968J, developed an excellent normal approximation that
is easily calculated. It has been refined and studied by Ling and Pratt [1981J
and is given at the end of Table E. See also Molenaar [1970].

3.3 Ties

Ties present no problem in the two-sample sign test with fixed ~ because we
defined" above" as meaning greater than or equal to and" below" as strictly
below. Ties are also easily handled for a quantile test, but a brief discussion is
needed here. Consider the median test, for example; then we intend to choose
t = N/2 or (N - 1)/2. However, if ties occur at the median of the com-
bined sample, then dichotomizing at the median will ordinarily lead to some
other value of t, a smaller value when the observations equal to the median are
counted as "above." The value of t could be preserved by breaking these ties
at random, along the lines of Sect. 6 of Chap. 3. A more appealing procedure
would be to dichotomize at a point slightly above or slightly below the
median, whichever value makes t closer to N /2. This is equivalent to keeping
the sample median as the dichotomizing point and redefining the terms
"above" and "below" in order to make t as close as possible to N /2. In
242 5 Two-Sample Rank Proccdure~ for LocatIOn

other words, the observations at the median are assigned to that category,
"above" or "below," which contains fewer other observations. We are free
to do this since t may be chosen as any function of the combined sample
without changing the null distribution given in (3.1), as remarked earlier.
It may sometimes be preferable, especially if the observations are changes
and the median occurs at "no change," to omit the observations at the
median. Then "below" means strictly below and .. above" means strictly
above, and the sample sizes are the numbers of observations different from
the median. The hypergeometric distribution continues to apply under the
null hypothesis, however (Problem 9c).
In the special case where the observations are not only not all different,
but also have only two possible values, the data are inherently in the form of
a 2 x 2 table. Then there is no freedom of choice regarding the value of t
e.
and the situation is more like the case of having a fixed However, if one of
the two possible values is called" below" and the other" above," this could
be considered an extreme case of the situation with ties described earlier.

3.4 Corresponding Confidence Procedures

Suppose that the shift assumption, as stated in Eqs. (2.1), (2.2) or (2.3),
holds so that X and Y - Jl have the same distribution, but Jl is unknown.
In order to test a null hypothesis which specifies a particular value for Jl
using a two-sample sign test procedure, we could subtract J1. from each lj
and then apply a two-sample test for identical distributions to the observa-
tions Xl' ... , X m , Yl - J1., ••• , Y,. - J1.. The confidence region corresponding
to such a test consists of those values of J1. which would be "accepted" when
so tested. We could proceed by trial and error, testing various values of Jl to
see which ones lead to "acceptance." However, for a two-sample median or
other quantile test, there is a very simple way to obtain these confidence limits
explicitly, as we will now see.
Consider a two-sample quantile test at level (X specifying t as the marginal
total of the first row. Let a and a' be the lower and upper critical values of A,
that is, P(A ::;; a) + P(A 2 a') ::;; (X under the hypergeometric distribution
with parameters m, n, and t. Then we would "accept" J1. if there are at least
(a + 1) X's and at most (a' - 1) X's among the t smallest of Xl' ... ' X m,
Y1 - J1., ••• , Y,. - Jl, and reject Jl otherwise. This region of" acceptance" is the
interval between two confidence limits which can be very simply stated in
terms of the order statistics of the two samples as follows.
Let X(I), .•. , X(m) , be the X's rearranged in order of increasing (algebraic)
value, so that X(1) ::;; X(2) ::;; ... ::;; X(m). Define Y(I), ..• , Y(n) similarly. Then
the test procedure" accepts" Jl if and only if (Problem to)

(3.6)
3 Median Test. Other Two-Sample SIgn Tests, Related Confidence Procedures 243

except perhaps at the endpoints, where the procedure has not been defined.
Equation (3.6) then gives the confidence interval for the shift /1 which cor-
responds to a two-sample quantile test with first row total t. The confidence
level is 1 - ex, where ex = P(A ::;; a) + P(A ~ a') according to the hyper-
geometric distribution for this m, n, and t. Of course, either the left-hand or
right-hand side of (3.6) may be used separately as a one-sided confidence
bound; then ex is P(A ~ a') or P(A ::;; a) respectively. If the distributions are
not continuous, the confidence levels are conservative as long as the end
points are included in the confidence regions (Problem 107).
The values of a and a' for given ex, or of ex for some selected a and a', can
be found from Table E for t = NI2 if N is even, or t = (N ± 1)/2 if N is
odd, that is, the values of t corresponding to the median test. It may be
desirable to use other values of t, especially since the choice of IX for anyone
t is very limited. However, this requires a more extensive table, since com-
putation of IX must of course be based on the value of t actually used.
Suppose that m = n = 10 and we use t = NI2 = 10. Then Table E shows
that choosing a = 2, a' = 8 gives ex = 0.0115 + 0.0115 = 0.0230. Thus at
level 1 - ex = 0.9770, (3.6) gives

1(3) - X(8) ::;; /1 ::;; 1(8) - X(3)

as the confidence interval for the shift /1 which corresponds to the median
test.
*Mood [1950, pp. 395-398] suggests confidence bounds of the form
1(r) - X(s) without relating them to the median test and its counterpart
with arbitrary t. (See also Mood and Graybill [1963, pp. 412-416], and Mood
et al. [1974, pp. 521-522].) It is not obvious that the formula given there for
IX is equivalent to the hypergeometric formula (Problem 11). The confidence
limits corresponding to the control median test, the first median test, and
Rosenbaum's test are all of the form (3.6) except that the upper and lower
limits employ different values of t (Problem 13).*

3.5 Power

In this section we discuss methods of finding the power of two-sample sign


tests against alternatives under which the X's and y's are independent
random samples from two different populations. Consider first the two-
sample sign test with ~ fixed and selected in advance. Here A and Bare
independently binomially distributed with respective paramefers PI =
P(X < ~) and P2 = P(Y < ~), that is, the proportions of the X and r
populations below ¢. Accordingly, the power of the two-sample sign test
with fixed dichotomizing point ¢ is simply the power of the test for equality
of two proportions against the alternative PI and P2' This is not a particularly
non parametric problem and will not be discussed further here.
244 5 Two-Sample Rank Procedures for LocatIOn

Consider now the median test, or, more generally, any quantile test with
the value of t fixed in advance. Suppose that the populations are continuous,
so that we may ignore the possibility of ties and hence of not being able to
attain the chosen t. Then, by (3.6), the two-tailed test rejects the null hypo-
thesis of equal populations if and only if

l(t-a'+I) - X(a') > ° or l(t-a) - X(a+l) < 0, (3.7)


where a and a' are respectively the lower and upper critical values of A. The
power of this test is then the probability of (3.7). Each inequality in (3.7) by
itself gives a one-tailed test whose power is the probability of that inequality.
Consider the second inequality. The order statistics X(a+ 1) and l(t-a) are
independent, and for any two specific populations, their densities can be
simply expressed algebraically in terms of the population distributions
(Problem 30, Chap. 2). The probability P(l(t-a) - X(a+ 1) < 0), and hence
the power of a lower-tailed test, can therefore be obtained from these densities
by a two-variable integration (Problem 14a). The power of an upper-tailed
test can be obtained similarly. The sum of these probabilities gives the power
of a two-tailed test since the two inequalities are mutually exclusive (Problem
14b). Hence the power of any quantile test with t fixed in advance is easily
found using the expression (3.7) of the rejection region in terms of order
statistics. Calculation of the power directly from the original definition of
the test in terms of a 2 x 2 table appears much more difficult. The exact
power of the median test in small samples was investigated in Gibbons
[1964c] for various alternatives.
The approximate normality of the order statistic leads to an approxima-
tion for the probability of (3.7). Specifically, if m and n are large and t/N
is not near 0 or 1, then a/m and (t - a)/n will not be near 0 or 1 provided
IX is not too small (Problem 15). If X has a density which is continuous and
nonzero at the quantile of order (a + I)/m for the X distribution, then it
follows that X(a+ 1) is approximately normal with some mean fJ.(X, a + 1)
and variance (J'2(X, a + 1) (see Problem 31, Chap. 2). Similarly l(t-a) is
approximately normal with mean fJ.(y, t - a) and variance (J'2(y, t - a), say.
Since X(a+ 1) and l(t-a) are independent, their difference is also approximately
normal, and P(l(t-al - X(a+ II < 0) can be approximated by the cumulative
standard normal distribution <I>(z), with

fJ.(X, a + 1) - fJ.(y, t - a)
z- (3.8)
- [(J'2(X, a + 1) + (J'2(y, t _ a)]1/2'
::-;.-':--'--~--';;--:"--------c-=

One approximation for the mean and variance is given by (Problem 16)
fJ.(X, a + 1) = the quantile of order p of the X distribution, (3.9)
(J'2(X, a + 1) = [P(1 - p)/m] {f[fJ.(x, a + I)J} -2, (3.10)

where p = (a + l)/m and f is the density of X; fJ.(y, t - a) and (J'2(y, t - a)


are similarly defined.
3 Median Test, Other Two-Sample Sign Tests, Related Confidence Procedures 245

3.6 Consistency

The criterion for consistency in one-sample (or paired-sample) tests was


given in Sect 3.4 of Chap. 3, A test based on two samples is called con-
sistent against a particular alternative if its power against that alternative
approaches 1 as the sample sizes m and n approach infinity. Of course, the
test must be defined for all sample sizes m and n, Thus consistency in the
two-sample case is, strictly speaking, a property of a double sequence of
tests, one test for each pair (m, n). Sometimes it is assumed that m and n are
related in such a way that min approaches a positive, finite limit as m and n
both approach infinity, We do not assume this, however, and permit m and n
to vary independently.
The median test is consistent against any alternative such that X and Y
have different medians, More precisely, for each m and n, consider a two-
tailed median test, and suppose that the level in each tail is at least e for
every m and n, where e is some positive constant which does not depend on
m and n (that is, the level in each tail is bounded away from 0 as m and n
approach (0). Then the probability of rejection approaches 1 as m and n
approach infinity if the X and Y populations have different medians. If the
medians are not both unique, this must be interpreted as meaning that no
median of X is also a median of Y. Figure 3.1 shows two sets of hypothetical
c,d.f.'s where the medians are not both unique. In 3.1(a), the X and Ymedians
have no common value even though the c.dJ.'s are quite similar. In Fig.
3.1(b), the c,dJ's are quite disparate but they have a point in common which
is a median of both populations. A necessary and sufficient condition for
X and Y to have no common median (Problem 17), which applies whether
e,
the medians are unique or not, is that there exists a such that either
P(X :::; ¢) < 0.5 < P(¢ :::; Y) or P(Y:::; ¢) < 0,5 < P(¢ :::; X). (3,11)
In the first case, ¢ is smaller than any median of X but larger than any median
of Y, while in the second case, ¢ is larger than any median of X and smaller
than any median of Y, as would be possible in Fig, 3.1(a).
Similarly, the one-tailed median test is consistent against alternatives
with medians which differ in the appropriate direction.

x x
(a) (b)

FIgure 3.1
246 5 Two-Sample Rank Procedllles for Locallon

*These facts can be proved (Problem 18a) using the fact that the left and
right sides of (3.6) are both consistent estimators of the difference of the
medians. Alternatively, (3.11) can be used along with the consistency of the
e
two-sample sign test with fixed (Problem 18b).*

3.7 "Optimum" Properties

As developed earlier, the median test is to a considerable extent a two-


sample analogue of the one-sample sign test. However, the optimum pro-
perties of the sign test (Sect. 3.2, Chap. 2) do not carryover to the median
test, or to the other two-sample quantile tests (that is, when t is chosen in
advance). A proof like that in Sect. 3.3 of Chap. 2 would require specifying
families of distributions for which the entries in the 2 x 2 table with t fixed
are sufficient statistics, and this seems impossible to do.
The usual test for 2 x 2 tables does have optimum properties like those of
the ordinary binomial test (Sects. 7 and 8, Chap 1). While these properties
are usually expressed in contexts where not all margins are fixed (equality
of proportions and double dichotomy, in the terminology of Sect. 3.2), they
can also be stated for all margins fixed, as in the case of the median test.
However, this gives optimality only among tests based solely on the entries
in the 2 x 2 table. Since the question of particular significance for non-
parametric statistics is whether a test based on some other statistic might be
better, we will not pursue the exact sense in which the median test is optimum
among tests based on the entries in the 2 x 2 table.
e
The two-sample sign test with fixed does have optimum properties
analogous to those of the one-sample sign test. Specifically, consider Table
e
3.1 with "below" meaning "below C where is chosen in advance. Among
e
tests of the null hypothesis that the probability below is the same in both
populations, Fisher's exact test applied to 2 x 2 tables like Table 3.1 has the
following properties: (a) a one-tailed test is uniformly most powerful un-
e
biased against the alternative that the probability below is larger in the
X than in the Y population, or smaller, whichever is appropriate; (b) any
two-tailed test is admissible; and (c) a two-tailed, (conditionally) unbiased
test is uniformly most powerful among unbiased tests against the alternative
e
that the probability below is not the same in both populations (Problem
19a). The restriction that the X's be identically distributed and the Y's be
identically distributed can also be relaxed as in Sect. 3.2 of Chap. 2 (Problem
19b).
e
Thus the two-sample sign test with fixed has some optimality properties
that are truly nonparametric in nature, while the quantile tests apparently
e
do not. This may seem to suggest that the sign test with fixed is generally
to be preferred to the median test. This is the case when one is interested in
e,
a single value of known in advance. Such situations are rare, however.
3 Median Test, Other Two-Sample Sign Tests, Related Confidence PlOcedures 247

and the advantages of optimality for fixed ~ are ordinarily outweighed by the
advantages of the median test in being able to select the point of dichotomiza-
tion sensibly in light of the combined sample.

3,8 Weakening the Assumptions

We have been assuming that all the observations are independent and
identically distributed under the null hypothesis. The name and construc-
tion of the median test might suggest that its level would be retained as long
as all observations are drawn from distributions with the same median.
However, this is unfortunately not the case. If the X's are drawn from one
population and the Y's from another population, where the medians are the
same but the scale parameters differ, the level may be seriously affected even
in large samples (see Pratt [1964] for further discussion and numerical,
asymptotic results). The same point applies, a fortiori, to the corresponding
confidence procedure. The two-sample sign test with fixed ~ is, of course,
valid whenever the probability below ~ is the same for the two populations,
but it would seldom happen that we know in advance a value of ~ for which
this assumption holds under a null hypothesis that allows different scales in
the two populations.
In a treatment-control comparison where the units are assigned at
random to the two groups, the randomization itself guarantees the level of
the median test for the null hypothesis that the treatment has no effect.
To be more specific, suppose the X's refer to a control group and the Y's
to a treatment group. Given the N units in the experiment, if any set of m
of them is as likely as any other set of m to be the control group, and each
unit would yield the same measurement whether treated or not, then the
probability of rejection by the median test is the usual null probability for
2 x 2 tables obtained from the hypergeometric distribution. The randomiza-
tion does not guarantee that the level is preserved in the corresponding
confidence procedure, however, in the absence of some property such as no
interaction between treatment and units, or the shift assumption. (See
Sect. 2, Problem 21, and for further detail and discussion, Sect. 7 of Chap. 2,
and Sect. 9 of Chap. 8.) The same statements apply to any other quantile
test.
Another kind of weakening is possible for all one-tailed, two-sample
sign tests, whether a quantile test or one with fixed ~. Suppose the Xi and
lj are independent, but not necessarily identically distributed, and consider
anyone-tailed, two-sample sign test with rejection region in the upper tail
of A (too many X's are in the "below" category). This test rejects with prob-
ability at most (X (the exact level when all observations are independently,
identically distributed) if
P(X i < z) $ P( lj < z) for all z, i, and j, (3.12)
248 5 Two-Sample Rank Procedures for Location

and with probability at least IX if

P(X i < Z) ;;:: P(~ < z) for all z, i, and j. (3.13)

(Compare Eqs. (3.21) and (3.22), Chap. 3.) Under (3.12), any Xi is less likely
to be to the left of any specified point than any ~ is, so that the distribution
of every Xi is "to the right" of (" stochastically larger" than) the distribution
of every ~. See Fig. 2.1(a) for a graphic illustration of this relationship.
Similarly (3.13) means that all the X's are "stochastically smaller" than all
the Y's. Since the probability of rejection is at most IX when (3.12) holds, the
null hypothesis could be broadened to include (3.12) without affecting the
significance level. Similarly it is natural to broaden the alternative to include
(3.13); since the probability of rejection is at least IX when (3.13) holds, the
test is by definition unbiased against (3.13).
On the other hand, against certain alternatives under which the X's are
drawn from one population and the Y's from another population with a
different median, the power of the median test is less than its level. The test
is then biased against this alternative; this is true even for the one-tailed
test in the indicated direction (Problem 22).
*The statements of the next-to-Iast paragraph are consequences (Problem
23) of the following fact, which is of interest in itself. Suppose a test ¢ is
"increasing" in the Y direction in the sense that, if ¢ rejects for X 1, ... , X m ,
Y1, ••• , y" and any Xi is decreased or ~ increased, ¢ still rejects. (The one-
tailed two-sample sign tests rejecting if A is too large have this property, by
Problem 23.) Then the probability that ¢ will reject increases (not necessarily
strictly) when the distribution of Xi is moved to the left or the distribution of
~ is moved to the right, that is, when the c.dJ. F j of Xi is replaced by Ft
where Ft(x) ;;:: Fi(x) for all x, or when the c.dJ. Gj of ~ is replaced by G1
where G1(y) ::;; Giy) for all y. Formally, for randomized tests ¢, we have the
following theorem.

Theorem 3.1 If Xl, ... , X m , Y1, ••• , y" are independent with c.df's F i , Gj and
¢(X 1, ... , Xu .. Yt> ... , y") is a randomized test function which is decreasing
in each Xi and increasing in each ~,then the probability of rejection

is an increasing function of the F j and a decreasing function of the GJ' Spe-


cifically,

if F:(x) 2 F,(x)for all x and i and G1(y) ~ Giy)for all y andj.

The proof is similar to that of Theorem 3.3 of Chap. 3 and is left as


Problem 42. *
4 Procedures Based on Sums of Ranks 249

4 Procedures Based on Sums of Ranks


In this section we consider again the situation of two mutually independent
samples and the null hypothesis that the two populations sampled are
identical. The median test of the previous section makes use of only the
number of observations from each sample which are above or below the com-
bined sample median. A test which takes into account more of the available
relevant information might be expected to have greater power ordinarily.
A simple but important and useful test of this type is the rank sum test,
which will be discussed in this section. This test is frequently called the
Wilcoxon test or Mann-Whitney test (or sometimes even the Mann-
Whitney-Wilcoxon test). Kruskal [1957] has traced the history of rank sum
tests as far back as Deuchler [1914]. However, the publications initiating the
tests in modern terms are Wilcoxon [1945], Mann and Whitney [1947], and
also Festinger [1946].
The properties of the rank sum will be developed here under the assump-
tion that Xl"", X m , Yl , ... , Yn are independent observations drawn from
two populations and for the null hypothesis that the populations are equiva-
lent. The shift assumption is not necessary here and usually will not be made,
although there is reason to think that the rank sum test has especially good
performance for location alternatives. If we do make the shift assumption,
the null hypothesis can be stated as H 0: J.1 = 0 and further, the rank sum
test procedure can easily be modified to provide a test of H 0: J.1 = J.10' The
corresponding confidence procedures for the shift parameter are explained
in Sect. 4.3. In Sect. 4.6, we will see that the assumptions can be weakened in
certain ways without affecting the level of the test or confidence procedure.
We will usually assume that both populations are continuous so that we
need not consider the possibility of ties either across or within samples. The
problem of ties is considered explicitly in Sect. 4.7.

4.1 The Rank Sum Test Procedure

To carry out the rank sum test procedure, we first combine the m X's and
nY's into a single group of m + n = N observations, which are all different
because of the continuity assumption. We then arrange the pooled observa-
tions in order of magnitude, but keep track of which observations are from
which sample. We assign the ranks 1,2, ... , N to the combined ordered
observations, with 1 for the smallest and N for the largest.
The data shown in Table 4.1 have been ranked by this procedure. (Often
in practice the first row is omitted and the values which are from say the X
sample are underlined or similarly indicated.)
The rank sum can be defined as the sum of the ranks in either sample;
we use Rx to denote the sum of the ranks of the X observations and Ry for
250 5 Two-Sample Rank PlOcedures for LocatIOn

Table 4.la
Sample Y X X Y Y X X Y X Y
Value 1.25 1.75 3.25 4.25 5.25 6.25 6.75 7.25 9.00 10.00
Rank 2 3 4 5 6 7 8 9 10

a These data are from United States Senate [1953], Hearings Before the Select
Committee 011 Small Busilless, Eighty-third Congress, First Session on Investiga-
tIOn of Battery Additive AD---X2 (March 31, June 22-26). The X's and Y's refer to
untreated batteries and batteries treated with AD-X2 respectively. The values
gIven here were obtained by averagmg the ranks given on performance of the
battenes by two representatIves of the manufacturer. The assumptions of the
begmnlllg of thiS section are not satisfied, but the battenes for treatment were
selected randomly from the 10 battenes, and this also vahdates the test, as dis-
cussed in Section 4.6.

the sum of the ranks of the Y observations. Since Rx + Ry is the sum of all
the ranks, 1 + 2 + ... + N = N(N + 1)/2, we have

Rx + Ry = N(N + 1)/2 (4.1)

and the tests based on Rx and Ry are therefore equivalent. In Table 4.1 we
have

Rx = 2 +3+6+7+9= 27,

Ry = 1 + 4 + 5 + 8 + 10 = 28,

and

Rx + Ry = 55

which is in agreement with (4.1). Either statistic Rx or Ry is commonly called


the Wilcoxon rank sum statistic.
The rank sum test can also be based on another equivalent statistic,
usually called the Mann-Whitney statistic and denoted by V. V can be
defined as the number of (X, Y) pairs for which X > Y, or informally as
the number of X's greater than Y's (or Y's smaller than X's). Equivalently,
V is the sum over i = 1, ... , m of the number of Y's which are smaller than
Xi' or the sum over j = 1, ... , n of the number of X's larger than lj. In
Table 4.1 there are 5 X's larger than the smallest Y, 3 X's larger than the
second smallest Y,3 larger than the third Y, 1 larger than the fourth Y, and
o larger than the largest Y, giving a sum V = 5 + 3 + 3 + 1 + 0 = 12.
If we reverse the roles of X and Y, we can define an alternative Mann-
Whitney statistic V' as the number of X's which are smaller than Y's (or Y's
larger than X's). Then V + Viis the total number of (X, Y) pairs, or

V + V' = mn. (4.2)


4 PlOccdures Ba,ed on Sums of Ranks 251

The foregoing definitions lead fairly directly to a linear relation between


the Mann-Whitney and Wilcoxon statistics; specifically (Problem 29),
V = Rx - m(m + 1)/2 = mn + n(n + 1)/2 - Ry, (4.3)
V' = Ry -- n(n + 1)/2 = mn + m(m + 1)/2 - Rx. (4.4)
In some cases, it may be easier to find V or V' by calculating Rx or Ry and
then using (4.3) or (4.4), rather than to find them directly.
Under the null hypothesis of identical populations, the ranks of the
X's are equally likely to be any set of m from the integers 1, 2, ... , N. In
other words, the X ranks consistute a random sample of size m drawn with-
out replacement from the set {I, 2, ... ,N}. If, on the other hand, the Y
population tends to have larger values than the X population, we would
expect the X ranks to be smaller than under the null hypothesis. For one-
sided alternatives in this direction, the appropriate regions for rejection are
then small values of Rx or V, and large values of Ry or V'. The corresponding
test against the alternative that the Y population tends to have larger values
than the X population rejects for large values of Rx or V, and small values of
Ry or V'. An equal-tailed test against the two-sided alternative of a difference
in either direction rejects at level 2cx if either of the foregoing one-tailed
tests rejects at level cx.
To test the null hypothesis that the Y population is the same as the X
population except for a shift by the specified amount Jio, the procedure above
is followed without change except that Jio is subtracted from each Y before
the combined samples are arranged in order of size. That is, the rank sum
test is applied in exactly the same way as above, but to X" ... , X m' Y, - Jio,
... , 1;. - /-to instead of X" ... , X m' Y" ... , Yn • The corresponding con-
fidence interval procedures will be discussed in Sect. 4.3.
For investigation of theoretical properties of the rank sum test, two repre-
sentations of the test statistics are often convenient. For the Mann-Whitney
statistic, it is natural to write
m n
V = L L V,j (4.5a)
j=, j='
where

V.
< X·
= { I if Y.J ' (4.5b)
'J 0 if lj > Xj'
The Wilcoxon rank sum statistic is represented naturally as
N
Rx=Lk1k
, (4.6a)

where
I if the observation with rank k is an X
{ (4.6b)
h = 0 if the observation with rank k is a Y.
252 5 Two-Sample Rank PIOcedures for Location

We can, of course, use whichever representation is more convenient for the


purpose at hand, since the statistics are linearly related.

4.2 Null Distribution of the Rank Sum Statistics

The exact null distribution of any of these rank sum statistics is based on the
fact stated before that, under the null hypothesis of identical distributions,
the X ranks constitute a random sample of size m drawn without replacement
from the first N integers, where N = m + n. Equivalently, all arrangements
of the mX's and nY's in order of size are equally likely. With (~) possible
arrangements, each one occurs with probability 1/(~). This fact determines
the null distribution of R x , and by (4.1), (4.3) and (4.4), also that of R y , U
and U'.
The direct method of generating the null distribution of say Rx is to list
all possible arrangements, calculate the value of Rx for each, and tally the
results. Then

P(R x = t) = v(t)/(~)

where v(t) is the number of arrangements for which the sum of the X ranks
equals t. For tabulation it is more efficient to use an easily developed recur-
sive technique (Problem 30). Fix and Hodges [1955] present a more sophisti-
cated approach, tabulating related quantities more compactly than is possible
for the distribution itself (Problem 32).
The mean and variance of these rank sum statistics under the null hypo-
thesis are most easily evaluated by using the fact that Rx is the sum of m
observations drawn without replacement from the finite population consist-
ing of {t, ... , N}. The mean and variance of this population (Problem 33)
are (N + 1)/2 and (N 2 - 1)/12. The mean and variance (calculated using
the finite-population correction factor) of the sample sum are therefore

E(Rx) = m(N + 1)/2 (4.7)

var(RJ = [m(N - m)/(N - 1)][(N 2 - 1)/12] = mn(N + 1)/12. (4.8)


The means and variances of R y , U and U' under the null hypothesis can then
be found from the relationships given in (4.1), (4.3) and (4.4). The results
(Problem 34) are
E(R y) = n(N + 1)/2, var(R y ) = var(RJ = mn(N + 1)/12, (4.9)

E( U) = E( U') = mn/2, var(U) = var(U') = mn(N + 1)/12. (4.10)


U and U' are identically distributed (Problem 35), and their possible
values are all the integers between 0 and mn inclusive. The possible values of
Rx are all the integers from m(m + 1)/2 to m(2N - m + 1)/2, and of Ry the
4 Procedures Based on Sums of Ranks 253

integers from n(n + 1)/2 to n(2N - n + 1)/2 (Problem 36). The distributions
of V, V', Rx and Ry are all symmetric about their respective means for any
m and n (Problem 37).
Since all the rank sum statistics are equivalent, a table of the null dis-
tribution is needed for only one of these statistics. Table F at the back of the
book gives the cumulative tail probabilities of Rx for m ~ n ~ 10. Only the
smaller tail probability is given; each entry is both a lower tail probability
for Rx ~ m(N + 1)/2 and a symmetrically equivalent upper tail probability
for Rx ~ m(N + 1)/2. In order to use this table, the sample with fewer
observations should be labeled the X sample. More extensive tables are
published in Harter and Owen [1970].
For m and n large, Rx, R y, V and V' are all approximately normally
distributed under the null hypothesis [Mann and Whitney, 1947], with the
means and variances given above. Small tail probabilities are generally
overestimated by the normal approximation. For sample sizes both smaller
than 20, for example, it is better to omit the continuity correction of t in
such a way as to reduce the tail probability when the standardized normal
variable is greater than 2 or, for comparison with critical values from normal
tables, when the one-sided significance level is smaller than 0.025. See Jacob-
son [1963] for more detail.

4.3 Corresponding Confidence Procedures

We now turn to the problem of setting confidence bounds on fl when X I'


... , X m , and Y1 , ••. , Yn are independent random samples from two con-
tinuous populations and the Y population is the same as the X population
except for a shift by the amount fl. The confidence region for fl which cor-
responds to the rank sum test consists of all values of fl which would be
.. accepted" if the rank sum test procedure were applied to XI' ... , X m'
Y1 - fl, ... , Yn - fl· That region could be found by trial and error, testing
various values of fl to see which ones lead to .. acceptance." A more systematic
approach is possible, however, leading to a simpler method, as will now be
described.
For the observations X I' ... , X m , Y1 - fl,···, Yn - fl, the statistic V is
the number of pairs (Xi' ~ - fl) for which Xi > lj - fl, or, equivalently
lj - XI < fl. Accordingly, the rank sum test with rejection region V ~ k
applied for a particular hypothesized value of fl would reject or accept
according as fl is smaller or larger than the (k + l)th of the mn differences
lj - XI when these differences are arranged from smallest to largest. Thus,
the (k + l)th difference from the smallest is the lower confidence bound for
fl corresponding to this test, with confidence level 1 - ex, where ex is equal to
the null probability that V ~ k. Similarly, the (k + l)th difference from the
largest is an upper confidence bound for fl at level 1 - ex. A table of the null
distribution of V is not given in this book, but the above procedure is easily
254 5 Two-Sample Rank Procedures for Location

carried out using Rx. If c is that number from Table F such that P(R x ::; c) =
IX, then by (4.3) we have

k = c - m(m + 1)/2. (4.11)

This is equivalent to saying that k is the number of possible values of Rx


below the critical value at level IX, that is, k + 1 is the rank of the lower tail
critical value at level IX of Rx among all possible values of Rx (Problem 39).
There is a convenient graphical procedure for finding the (k + 1)th
from the smallest and/or largest of the differences lj - Xi and hence for
determining the confidence bounds. The method is illustrated in Fig. 4.1
for the data given in Table 4.1. Each of the N observations is plotted on a
rectangular coordinate system, the X observations on the abscissa and Y

y
IO~----~----~----------~~r--------'

6 7 8 9 X

Figure 4.1
4 Procedures Based on Sums of Ranks 255

observations on the ordinate. A horizontal line is drawn through each Y,


and a vertical line through each X,. Each of the mn intersections represents a
pair of observations (X;, lj). Note that an intersection with an axis does not
count unless this axis is the line through an observation equal to O. The 45°
line Y - X = 11, for any number 11 as Y intercept, divides the (X, Y) pairs
into two groups. Those on the left and above have Y - X > 11, and those
on the right and below have Y - X < 11. Thus the upper confidence bound
on 11 is the Y -intercept of the 45° line which has k intersections to the upper
left of it and passes through the (k + l)th intersection. If greater accuracy
is desired, the difference Y, - Xi can be computed for that (Xi' lj) pair
which gives the (k + l)th intersection. In close cases it may be necessary
to compute more than one lj - Xi to determine exactly which is the
(k + l)th.
We illustrate these calculations using the example of Table 4.1 even
though the shift assumption is unreasonable in the situation producing these
data. We have m = n = 5 and from Table F, P(R x ~ 20) = 0.041 so that
c = 20 for IX = 0.041. From (4.11), k = 20 - 5(6)/2 = 5. The line in Fig. 4.1
which has Y intercept 3.50 has· five intersections above and to the left of
it and passes through the sixth intersection (1.75, 5.25), so that 3.50( = 5.25-
1.75) is an upper confidence bound for 11 with confidence coefficient 0.959
(= 1 - 0.041). Similarly, - 2.50 is a lower bound at this same level, and
- 2.50 < 11 < 3.50 is a confidence interval with level 1 - 2(0.041) = 0.918.

4.4 Approximate Power

The confidence procedure just developed, corresponding to the rank sum


test procedure, required the shift assumption, but that assumption was not
made to develop the test or its null distribution theory. This subsection and
the next concern further properties of the lest procedure, namely its power
and consistency, under general alternatives which need not be shifts. We
assume only that the X and Y observations are mutually independent, each
set having the same continuous distribution.
While the exact power of the rank -sum tests can be evaluated for specific
alternatives (see for example, Gibbons [1964c]), the present discussion is
limited to asymptotic power. In Sect. 4.2, we stated that the null distributions
of V, V', Rx and Ry are asymptotically normal. Since these statistics are also
asymptotically normal under most alternative hypotheses, the power of the
rank sum test against a specified alternative can be approximated by apply-
ing Table A to the appropriate standardized variable once the mean and
variance under that alternative are evaluated. (The asymptotic normality
and an appropriate standardization follow from results proved by Chernoff
and Savage [1958] and Hoeffding [1948] for more general classes of statistics
based on two sets of identically distributed observations. For the rank sum
statistic, Capon [1961a] gives a generalization to non identically distributed
256 5 Two-Sample Rank Procedures for Location

sets of observations, along with general expressions for the mean and
variance of the test statistic.)
In order to approximate the power of the rank sum test against specified
alternative distribution, it will be convenient to introduce the probabilities:

PI = P(X i > lj) (4.12)

P2 = P(Xi > lj and X k > ~) (4.13)

P3 = P(X, > lj and Xi > Y,) (4.14)

for all i, j, k, 1 with i =1= k and j =1= I. Hence PI is the probability that an X
variable exceeds a Y variable; P2 is the probability that two different X
variables both exceed a single Y variable; and P3 is the probability that an
X variable exceeds both of two different Y variables. Integral expressions for
these three probabilities are given in Problem 41.
For any X and Y distributions, the moments of U needed for standardiza-
tion can be expressed in terms of these probabilities as

E(U) = mnpl (4.15)

var(U) = mn[(m - 1)(p2 - pi) + (n - l)(P3 - pi) + PI{l - PI)]. (4.16)

We will now prove these results using the expression for U given in (4.5a),
where U is the sum of mn indicator variables Ui}' defined by (4.5b) (Uij = 1
if Xi > YJ These Uii are Bernoulli random variables, identically distributed
although not all independent. In terms of the probability PI' their mean is

E(Uij) = P(Xi > lj) = PI for all i, j (4.17)

so that
III n
E(U) = L: L E(U ,)
1=1 }=I
= mnPI'

In terms of the probabilities PI' P2 and P3' the second-order moments of the
U'i are

var(U,) = PI(l - PI) (4.18)

cov(U ,j , Ukj ) = P(X, > lj and X k > lj) - pi


= P2 - pi (4.19)

cOV(Uij, Uil) = P(Xi > lj and Xi > Y,) - pi


= P3 - pi (4.20)

(4.21)
4 Procedures Based 011 Sums of Ranks 257

for all i, j, k, 1 with i of k and j of I. We now express the variance of U in


terms of the moments of Uij (Problem 43) as
var(U) = mn[var(U iJ ) + (m - 1) cov(U ij , Uk) + (n - 1) cov(U ij , UiI)
+(m - 1)(n - 1) cov(Uij, Uk')] (4.22)
for i of k, j of I. Substituting (4.18)-(4.21) into (4.22) gives (4.16) immediately.

4.5 Consistency

A two-sample test is called consistent against a particular alternative if


the power of the test against that alternative approaches 1 as the sample
sizes approach infinity. (For more detail, see Sect. 3.6.) The rank sum test is
consistent against any alternative under which the X's and Y's are inde-
pendent observations from two continuous populations with PI of t, where
PI is the probability that an X exceeds a Y, as at Equation (4.12). More
precisely, for each m and n, consider a two-tailed rank sum test with level at
least e in each tail, where e is a positive constant. Then the probability of
rejection approaches 1 as m and n approach infinity if the X and Y popula-
tions are such that PI of t. Similarly, a one-tailed rank sum test is consistent
against alternatives with PI different from t in the appropriate direction.
These consistency properties are easily proved (Problem 46) using the fact
that U/mn is a consistent estimator of PI' as will be shown in Sect. 4.8.
It should be noted that there are situations in which the median test is
consistent but the rank sum test is not, and vice versa. For example, if the
X and Y distributions have different medians but PI = t, then the rank sum
test is not consistent while the median test is. Similarly, jf X and Y have
equal medians but PI of t, then the rank sum test is consistent but the
median test is not (Problem 47). If the X and Y popUlations differ and one
stochastically dominates the other, then PI of 1- and hence the rank sum test
is consistent (Problem 51).

4.6 Weakening the Assumptions

The null distribution of the rank sum test statistic is derived under the
assumption that the X ranks are equally likely to be any set of m out of the
integers 1,2, ... , N. This assumption is satisfied if the X's and Y's are
drawn from the same population. If the populations differ in any way, how-
ever, the level of the test is ordinarily affected. In particular, the assumption
that PI = 1. where PI = P(X > Y), is not sufficient to guarantee the level.
Even for populations which are symmetric about the same point and have
the same shape, the level may be seriously affected if their variances differ.
(See Pratt [1964] and also Problem 52.) The same observation applies, a
fortiori, to the corresponding confidence procedure.
258 5 Two-Sample Rank Procedures for Location

The assumptions can be weakened, however, in the same way as was


possible for the median test (Sect. 3.8). In a treatment-control comparison,
random assignment of the units to the two groups itself guarantees the level
of the rank sum test for the null hypothesis that the treatment has no effect
(but not the level of the corresponding confidence bounds for a shift without
something like a shift assumption). In the ordinary case where the X's and
Y's are all mutually independent, if they are not necessarily identically dis-
tributed, then the one-tailed test rejecting when U is too small (too few
X's greater than Y's) rejects with probability at most C( if the distribution of
every XI is "stochastically larger" than the distribution of every lj, as
defined in Equation (3.12), and hence the null hypothesis could be broadened
to include this possibility. The same test rejects with probability at least C(
if the distribution of every Xi is "stochastically smaller" than the distribution
of every lj, as in Equation (3.13), so the test is "unbiased" against this
alternative. (The proof, requested in Problem 54, is like that for the median
test.) However, the test is not unbiased against the alternative that the X's
are drawn from one population and the Y's from another and Pl < 1, as
its power is less than C( for some alternatives satisfying this condition (Problem
55).

4.7 Ties

Two or more observations which are equal in value are called tied. The
ranks of tied observations have not yet been defined, and hence the rank sum
test cannot be applied in the presence of ties without some further specifica-
tion. We have avoided this difficulty so far by assuming continuous dis-
tributions and hence zero probability of a tie. In practice, we must have a
method of dealing with ties because of discontinuous distributions or un-
refined measurements. The discussion here will parallel but abridge the
corresponding discussion of zeros and ties in Sect. 6 of Chap. 3; in particular,
"zeros" have no counterpart here.
The confidence procedures (given in Sect. 4.3) for the amount of a shift
depend only on the differences lj - Xi' Even when some of these differences
are tied, we can determine the (k + 1)th from the smallest difference and this is
still a lower confidence bound L for the shift J1. with k defined as before. How-
ever, the exact confidence level now depends on whether L is included in the
confidence interval or not. More precisely,
peL :s; J1.) ;::: 1 - C( ;::: peL < J1.), (4.23)
where 1 - C( is the exact confidence level in the continuous case (Problem
56; see also Problems 62 and 107). A corresponding statement holds for an
upper confidence bound, and for two-sided confidence limits. Thus the
confidence procedures of Sect. 4.3 can still be used, but now it makes a
theoretical difference whether or not the endpoints are included in the stated
4 PlOccdures Based on Sums of Ranks 259

interval. Ordinarily this is of no practical consequence and the issue need not
be resolved.
Since the confidence procedures are still applicable, they could be used
to test the null hypothesis that the amount of the shift is f1 = 0, which is
equivalent to the hypothesis of identical populations. If 0 is not an endpoint
of the confidence interval, the corresponding test rejects or "accepts" the
null hypothesis according as 0 is outside or inside the confidence interval.
By Problem 57, this is equivalent to rejecting ("accepting") if the ordinary
rank sum test rejects (" accepts") no matter how the ties are broken. If 0
is an endpoint of the confidence interval, it may be sufficient to state this
fact and not actually carry the test procedure further. Another possibility
is to be "conservative" and" accept" the null hypothesis in all borderline
cases; this amounts to breaking the ties in the direction of "acceptance"
and corresponds to including the endpoint in the confidence interval state-
ment. When many ties are likely, however, both these possibilities may reduce
the power considerably.
Two other basic methods of handling ties are the average rank method
and breaking the ties, which we now discuss. Examples will be given shortly.
The average rank (or midrank) method assigns to each member of a
group of tied observations the simple average of the ranks they would have
if they were not tied. The rank sum statistic is then computed as before, but
its null distribution is not the same as for observations without ties. The
exact distribution conditional on the ties can be enumerated, or a normal
approximation can be used (see below). The average rank procedure is
equivalent to defining the Mann-Whitney U statistic as the number of
(X" ~) pairs for which Xi> lj plus one-half of the number for which
Xi = lj, because U and Rx continue to be related by Equation (4.3) when
U is defined in this way and Rx is computed from the average ranks (Problem
58).
Methods which break the ties assign distinct integer ranks to the tied
observations. If the ties are boken randomly the usual null distribution of
the rank sum statistic is preserved. Another possibility already mentioned
is to break the ties in favor of acceptance.

Test procedures. To illustrate tests based on these methods of handling ties.


consider the following samples, each arranged in order of magnitude.
X sample (m = 4) 0, 1, 2, 3
Y sample (n = 10) 1, 1, 2, 2, 3, 3, 3, 4, 4, 5 (4.24)
Suppose first that we are using the average rank method. There are, for
example, three l's tied at positions 2, 3, and 4. Accordingly, each of these l's
is given the average rank of 3, and similarly for the other sets of tied ranks.
The average rank results are given in Table 4.2.
Under the null hypothesis that the X and Y populations are the same,
the distribution of the X rank sum Rx obtained by the average rank method
260 5 Two-Sample Rank ProccdUlcs for LocatIOn

Table 4.2
Observation 01112223333 4 4 5
Sample XXYYXyy XYYY Y Y Y
Average Rank 3 3 3 6 6 6 9.5 9.5 9.5 9.5 12.5 12.5 14

can again be determined from the fact that each possible set of m ranks is
equally likely to be that belonging to the X observations, but this distribution
is now conditional on the positions of the ties in the combined sample, or
equivalently, on the average ranks present. For the data in (4.24), we have
m = 4 and n = 10 so that there are (~) = (144) = 1001 ways to select a set of
m = 4 ranks out of the N = 14 ranks. Table 4.2 shows that there are only 6
different average ranks present, in a pattern consisting of one 1, three 3's,
three 6's, four 9.5's, two 12.5's, and one 14. Of the 1001 possible selections of
m = 4 average ranks given this pattern, one selection (three 3's and one 1)
gives Rx = 10, nine selections give Rx = 13, three give Rx = 15, etc. A
portion of the lower tail of the distribution of Rx, given this pattern of ties,
is shown in Table 4.3. Note that the distribution is very uneven and lumpy,
as is frequently the case when many ties occur.
For the data in (4.24), the X rank sum can be found from Table 4.2 as
Rx = 19.5. From the distribution in Table 4.3, we see that under the null
hypothesis, given the ties observed, the exact probability of an X rank sum
as small as or smaller than that observed is P(R x .::;; 19.5) = 84/1001 = 0.084.
Enumeration of the exact distribution of Rx based on average ranks in the
presence of ties can be lengthy, but it is not difficult to carry out by computer.
In favorable circumstances, the distribution of Rx given in Table F could be
used, but it applies exactly only when no ties are present; in general, it
should not be used when the average rank method is applied in examples with
many ties. For the data in (4.24), Table F gives the P-values P(R x .::;; 19) =
0.071 and P(R x .::;; 20) = 0.094; these results, although close to the true P-
value 0.084 found in the paragraph above, are not correct, and in other
examples the discrepancy may be greater. Another possibility is to use the
normal approximation once the relevant mean and variance conditional
on the ties observed are obtained (Problem 59). This procedure may also
be very inaccurate in the presence of many ties because the exact distribution
is generally lumpy, as noted for Table 4.3. Simulation could also be used.
Lehman [1961] performed an interesting but limited comparison between
the exact and approximate distributions with ties for the case m = n = 5.

Table 4.3
r 10 13 15 16 16.5 18 18.5 19 19.5 21
1001 P(R x = r) 9 3 9 12 9 4 36 3
4 Procedures Based on Sums of Ranks 261

We next consider applying some tiebreaking methods to the data in


(4.24). The three observations which are ones could be assigned the ranks
2, 3, 4 in any order, so that anyone of 2, 3, or 4 could be an X rank and the
other two Y ranks, and similarly for the remaining groups of ties. Thus there
are mmm = 36 different possible ways of breaking the ties.
If each group of ties is broken separately by some random procedure, the
method is called breaking the ties at random. This procedure, though un-
appealing, preserves the usual null distribution of the rank sum statistic;
hence the P-value or a critical value can be found from Table F.
The two extreme methods of breaking the ties giving (i) the smallest, and
(ii) the largest, rank sum are shown in Table 4.4 for the data in (4.24); they
give Rx = 16 and Rx = 23 respectively. As a result, we know that any method
of breaking the ties gives 16 :::;; Rx :::;; 23, and the corresponding range of
P-values from Table F is 0.027 :::;; P :::;; 0.187. In particular, breaking the
ties at random would give some P-value in this interval.
The "conservative" procedure is to break all ties in favor of acceptance.
In this example, where we are rejecting when Rx is small, the "conservative"
value is Rx = 23 and the corresponding" conservative" P-value from Table
F is P(R x :::;; 23) = 0.187.
The other extreme would be to break all ties in favor of rejection and
therefore obain Rx = 16 and a corresponding P-value of 0.027 from Table F.
At the 0.05 level then, or any other level between 0.027 and 0.187, the rank
sum test would reject by one of the two extreme resolutions of ties but not
by the other. Thus looking at the range of possibilities under tiebreaking
would not provide an unambiguous decision for these data.
In some situations, of course, the two extreme cases of tiebreaking may
lead to the same decision. One might therefore hope to have the best of both
worlds by using tiebreaking when it is unambiguous and average ranks when
tiebreaking is indeterminate. This is not a valid shortcut, however, because
the level of the average rank test is calculated on the assumption that average
ranks will always be used, and average ranks can lead to opposite decisions
from tiebreaking even if tiebreaking is unambiguous. The best-of-both-
worlds test amounts to looking at more than" ancillary" data before deciding
on a test procedure, and hence its level would be very difficult to determine.
To illustrate the difficulty, consider the data in Table 4.5 where the ties
are in the same positions as for the data in (4.24).

Table 4.4
Observation o I I 2223333445
Sample x Xyy Xyy Xyyy yy Y

Ranks (i) 2 3 4 5 6 7 8 9 10 II 12 13 14
Ranks (ii) 4 2 3 7 5 6 II 8 9 10 12 13 14
262 5 Two-Sample Rank Procedures 1'01 LocatIOn

Table 4.5
Observation
°1"'12221333314415
Sample X yyy XXX yyyy yy y

Notice in Table 4.5 that all the ties are within samples. In such a case,
any method of breaking the ties gives the same X ranks, namely 1, 5, 6, 7,
and the same rank sum Rx = 19. If we ignore the ties and use Table F, the
probability of an even smaller rank sum is P(R x < 19) = 0.053, and of one
as small or smaller is P(R x ~ 19) = 0.071. At the one-sided level 0.05, the
populations would not be judged significantly different when the ties are
broken, no matter how they are broken; tiebreaking leads to no ambiguity.
On the other hand, if the average rank method is used on the data in
Table 4.5, the rank sum is again Rx = 19, but the null distribution of the
average rank test statistic given the ties observed is that in Table 4.3, not
Table F. Table 4.3 gives P(R x ~ 19) = 0.048, so that the average rank test
judges this sample as significant at the 0.05 level, the opposite conclusion
from tie breaking.
The null distribution in Table 4.3 is correct only if the average rank test
is used on all samples with ties in the same positions. If the average rank
method would be used in some cases, such as the data in (4.24), and if the
null distribution would be calculated conditional on the pattern of ties
observed, then the average rank method must be used in all cases, including
those where the tie breaking is unambiguous, such as the data in Table 4.5.
This example shows that, unfortunately, trying to obtain the best of both
worlds affects the level of the average rank procedure.

Choice of Procedure. Which test procedure should we use? The following


requirements (analogous to those stated in Sect. 6.5 of Chap. 3 for the one-
sample case) seem intuitively desirable.
(i) A sample which is significant in the direction X < Y shall not become
insignificant nor an insignificant sample significant in the direction X > Y
when (a) some Y's are increased or X's decreased, or (b) all Y's are increased
or all X's decreased by an equal amount.
(ii) Those values of an assumed shift f1 that would be "accepted" if tested
shall form an interval. (This says that the confidence region corresponding
to the test shall be an interval, and is equivalent to (i)(b) (Problem 60).)
(iii) A sample shall be judged significant in either direction if it is sig-
nificant in that direction no matter how the ties are broken; similarly for
not significant.
The average rank test procedure satisfies (i) (b) and (ii) but not (i) (a)
and (iii) (Problem 61). This procedure presumably gives better power, at
least in any ordinary situation, than breaking the ties either conservatively
or randomly. The regular tables do not apply, however, and the null dis-
tnbution must be generated for the set of average ranks observed.
4 Procedures Based on Sums of Ranks 263

The "conservative" procedure, that is, breaking the ties and choosing
the value of the test statistic that is least favorable to rejection, satisfies all of
these requirements. However, the true significance level is unknown 'and
may sometimes be much smaller than the nominal level, resulting in a con-
siderable reduction in power over the average rank procedure.
Breaking the ties at random permits use of the ordinary tables and
satisfies all of the requirements above (Problem 62). However, the introduc-
tion of extraneous randomness in an artificial way is objectionable in itself,
and presumably reduces the power.
The confidence bounds for an assumed shift f.1 corresponding to any
method of breaking ties are those obtained in Sect. 4.3. Whether or not the
confidence bounds are included in the confidence interval depends on how
the ties are broken. The confidence regions corresponding to the average
rank procedure may be different, although they are also intervals (Problem
63).

4.8 Point and Confidence Interval Estimation of P(X > Y)

There are situations in which one is interested in the probability that a ran-
domly selected member of the X population will exceed an independent,
randomly selected member of the Y population. This probability is the
parameter PI = P(X > Y) defined earlier by (4.12). Suppose, for example,
that X is the strength of a manufactured item and Y is the maximum stress
to which it will be subjected when installed in an assembly (Birnbaum,
1956]. If X > Y, the component will not fail in use. In such a case, PI is a
parameter of clear economic importance. It might also be of interest in non-
economic contexts. In a comparison of two populations, it is frequently
desirable to say something about how much they differ, in addition to, or
instead of, performing a test of the hypothesis that they are the same. The
difference between the population means or medians, and the amount
of the shift f.1 if the shift assumption is made, are defined only if the difference
between two items can be measured on some numerical scale. A point
estimate or confidence interval for these quantities has meaning only to the
extent that the scale has meaning. However, in the absence of such a meaning-
ful scale, as long as the items can be ranked, PI is still meaningful. Accordingly,
we will now discuss point and interval estimation of PI' but again under the
assumption that ties occur with probability zero.
Since PI is the probability that an X exceeds a Y, a natural estimator is
the proportion of (X" ~) pairs for which X, > lj, that is, U /mn. This estima-
tor has expected value PI by (4.15), and the variance can be found from (4.16).
Hence U /mn is unbiased for PI' and it is consistent (Problem 65), that is,
for every [; > 0,

p( If~ - PI I> e) ~ ° as mand n~ 00. (4.24)


264 5 Two-Sample Rank Procedures for LocatIOn

If the class of possible distributions is broad enough, for instance, if the


X's and Y's may be drawn from any two continuous distributions, then no
other unbiased estimator has variance as small (Problem 66). In a sufficiently
restricted class of distributions, unbiased estimators with smaller variance
may exist, but, as in Sect. 3.2 of Chap. 2, the greater the apparent advantages
of such estimators, the greater risk accompanies their use. If the two popula-
tions are normal with means /1x, /1y and common variance a 2, for example, the
value of PI is P'I = <I>[(/1x - /1y)/aj2J. A natural estimator of pi" and hence
of PI' is PI = <I>[(X - ¥)!sj2], where X and ¥ are the sample means, s is
the usual estimate of the standard deviation, and <I> is the standard normal
c.dJ. (This estimator is not quite unbiased for P'I under normality, but could
be made so (Problem 67) by an adjustment similar to that given by Ellison
[1964] for the one-sample case.) Under the normality assumption, and in
many other circumstances, PI has smaller variance than V /mn, in fact,
substantially smaller when PI is near 0 or 1 (Problem 68). However, its
values tend to cluster around P'I' which equals PI for normal distributions
but not in general. A slight departure from normality can easily lead to an
important difference between PI and P'I' (Typical goodness-of-fit tests of
normality shed almost no light on this particular question.) In fact, such
information as the observations provide for estimating PI beyond the value
of V/mn is relatively little and difficult to extract. All this is especially true
when the advantage of the estimator Plover PI would be greatest if the
assumption of normality were literally correct, namely when PI is near 0 or 1.
Of course, one might focus interest on the parameter P'I or even on (/1x - /1)/
aj2 instead of on PI' However, if PI is really the relevant parameter, this
does not solve the real problem and might encourage misunderstandings.
When PI is near t, V/mn may have smaller variance than PI and is not likely
to have much larger variance, so there is little to be gained from using some
other estimator (see also Sects. 3 and 10, Chap. 8).
Based on the normal approximation to the distribution of V, we can
obtain an approximate confidence interval for PI from the inequality

IV - mnpd ~ z Jvar(V) (4.25)


where z is an appropriate standard normal deviate. Var(V) could be estima-
ted by (4.16) with P2 and P3 replaced by estimates obtained from the data.
A natural estimator of P2 is the proportion of triples (Xi' Xb lj), i # k, for
which Xi > lj and X k > lj, and a natural estimator of P3 is the proportion
of triples (Xi' ~, Y,),j # I, with X, > ~ and Xi> Y,.
Alternatively, var(V) might be replaced by an upper bound. One upper
bound is given by the inequality (proof below)

var(V) ~ mnpI(1 - PI) max(m, n). (4.26)

This bound on the variance can be made sharper if the class of distributions
is restricted. For example, if the X population is stochastically smaller than
5 Procedures Based 011 Sums of Scm e, 265

the Y, that is, F(t) ~ G(t) for all t, then the variance satisfies (Birnbaum and
Klose [1957]; Rustagi [1962])
var(U) :s; mn[(1 - 2pI)3/2(2m - n - 1) + (n - 2m + 1)
+ 3PI(2m - 1) - 3pi(m +n- 1)]/3 (4.27)

for m :s; n, and similarly for m > n with m and n interchanged in (4.27). If
var(U) is replaced by either of these upper bounds, the right-hand side of
(4.25) still depends on PI' so that an interval for PI is not immediately
obtained. The inequalities resulting in (4.25) could be solved for PI (Problem
69), or the estimate U /mn could be substituted for P I on the right-hand side
of (4.25) to produce endpoints which do not involve Pl'

PROOF. The inequality in (4.26) follows from (4.22) and the inequalities
below. For i i= k,j i= I,

cov(U,j' Uk) + cov(Uij, UiI) = cov(Uij, Uk) + cov(l - Uij' 1 - Vii)


= P(Xi > lj and X k > lj) + P(Xi < lj and Xi < Y,) - pi - (1 - PI)2
S 1 - P(Xk < lj and Xi > Y,) - pi - (1 - PI)2
= 1 - (1 - PI)PI - pi - (1'- PI)2 = PI(1 - PI) (4.28)
cov(V'J' Vk) S [var(V i ) var(Vk) ] 1/2 = PI(1 - PI) (4.29)
cov(V'J' Vii) S [var(V,) var(V u)] 1/2 = PI(1 - PI)' (4.30)

In obtaining (4.28) we used the fact that the three events (X, > lj and
Xk > lj), (Xj < ~ and X j < Y,) and (Xk < lj and Xi > Y,) are mutually
exclusive, and hence the sum of their probabilities is at most 1. For m S n,
say, we write Equation (4.22) as

var(V) = mn{var(V i) + (m - 1)[cov(ViJ' Vk) + cov(V,j' V,I)]


+ (n - m) cov(Uij' Uil) + (m - l)(n - 1) cov(U'J' V kl )}

and substitute (4.18), (4.21), (4.28) and (4.30) to obtain the desired result,
~~ 0

5 Procedures Based on Sums of Scores


Suppose, as at the beginning of Sect. 4, that X I."', Xm and YI,.··. Yn are
mutually independent random samples from continuous popUlations (so
that "ties" need not be considered). In order to test the null hypothesis of
identical popUlations, the rank sum test could be generalized to use arbi-
trary constants which are not necessarily ranks. Once the combined sample
observations are arranged in order of magnitude, any set of numbers C I,
C2,"" CN (positive or negative), which we call scores, can be assigned to
266 5 Two-Sample Rank Procedures for LocatIOn

the observations. (The constant Ck is associated with XI if XI has rank k in the


combined sample.) Then the sum of scores of observations from say the
X sample provides a test statistic much like Rx.
For the data of Table 4.1, for any constants C 1, ••• , CN we have the follow-
ing assignment of scores.

Sample y X X Y Y Y X Y X Y
Value 1.25 1.75 3.25 4.25 5.25 6.25 6.75 7.25 9.00 10.00
Rank 2 3 4 5 6 7 8 9 10

The null distribution of the sum of scores test statistic can be determined
by enumeration in a manner analogous to that described for Rx in Sect. 4.2,
since under the null hypothesis the X scores constitute a random sample
of m scores drawn without replacement from the N available. A table of
P-values could then be constructed for any particular set of constants Ck.
Alternatively, a normal approximation could be used, by standardizing with
the mean and variance given in Problem 77a.
For a test of the null hypothesis that the Y population is the same as the
X population except for a shift by the amount J1., the foregoing test is applied
to X I, •.• , X m , Y1 - J1., ••• , Y,. - J1.. The set of all values of J1. which would
be "accepted" when so tested forms a confidence region for the amount of
the shift, under the model of the shift assumption. The confidence region will
be an interval if the Ck form a monotone sequence (Ck+ 1 - Ck has the same
sign for all k), and each confidence bound will be one of the mn differences
lj - Xi (see Sect. 6 and Problem 76).
The general sum of scores statistic can be written in a form analogous to
(4.6) as

(5.la)

where

_ {I if the observation with rank k is an X


(5.1 b)
Ik - 0 if the observation with rank k is a Y.

Similar statistics analogous to Ry are also easily defined. For any particular
set of scores Ck' the sum of scores statistics for the two samples are again
linearly related and hence equivalent as test statistics.
Many different two-sample rank tests are of this general type, including
all the ones we have studied so far in this chapter. If Ck = k for k = 1, ... , N,
the sum of scores test is simply the rank sum test. If C~ = 1 for k :5; N /2 and
5 Procedures Based on Sums or Scores 267

Ck = 0 otherwise, it is the median test. If Ck = 1 for k ~ t and Ck = 0 otherwise,


it is a two-sample quantile test with the observations divided into "below"
and "above" at the (t)th order statistic of the combined sample. The two-
sample sign test for fixed ¢, however, is not a sum of scores test, because t
would have to be chosen such that t observations are less than ¢ and hence
the scores are not fixed in advance; as remarked earlier, it is not a rank test
either.
Fisher and Yates [1963 and earlier editions starting in 1938J suggested
using as Ck the expected value ofthe kth from the smallest in a random sample
of N from the standard normal distribution; this choice arises naturally in
Sect. 7. Tables of these Ck' called normal scores, are given by Fisher and Yates
[1963J to two decimal places, and with more precision by Teichroew [1956J
and Harter [1961]. Terry [1952J gives tables of the null distribution of the
resulting test statistic and discusses approximations. Even the most straight-
forward normal approximation involves the sum L~ c~ (Problem 77), which
is tabulated by Fisher and Yates [1963J for N ~ 50. This sum can be found
from the individual c~ values given in Teichroew [1956J for N ::; 20. The
small sample power of this test (and other two-sample sum of scores tests)
was investigated in Gibbons [1964cJ for various alternatives. An optimality
property of this test and the rank sum test will be discussed in Sect. 7.
Van der Waerden [1952, 1953J suggested setting Ck equal to the quantile
of order k/(N + 1) for the standard normal distribution; this test is nearly
the same as the Fisher-Yates-Terry test without requiring special tables of
Ck. The constants here are especially conveniently obtained from tables of
rational quantiles of the normal distribution, for instance, Fisher and Yates
[1963J and van der Waerden and Nievergelt [1956]. The latter reference
also gives the null distribution of the test statistic, and van der Waerden
[1956] discusses approximations to this distribution.
The two foregoing normal scores tests actually differ little from the
Wilcoxon rank sum test at typical levels, at least in small samples. They do
offer a much wider choice of natural significance levels however. That is,
in the tail of the null distribution, the rank sum test statistic treats as alike
possible sample outcomes which are distinguished by the normal scores
test statistic. The comments of the last two paragraphs of Sect. 7.1 of Chap. 3
apply here also, with appropriate modification.
In order to emphasize the fact that the various sums of scores tests do
distinguish between arrangements of the X and Yvariables in different, but
related ways, the values of the extreme small values of these test statistics
and the corresponding cumulative (left-tail) frequencies are shown in Table
5.1 for the sample sizes m = 6, n = 6. These frequencies are divided by
el) = 924 to obtain probabilities. The fact that the rank sum statistic
differentiates rank orders in a much less refined way is obvious from this
listing. The Fisher-Yates-Terry (T2 ) and van der Waerden (T3) test statistics
give almost identical orderings for that part of the tail which is listed (49/924
or 5.3 %); the orderings are identical for the first 4.2 % (39/924) of the tail
268 5 Two-Sample Rank Procedures for Location

Table 5.1 Cumulative Left-Tail Fre-


quencies L f for Rank Sum Statistic
R x , Fisher-Yates-Terry Statistics T2
and van der Waerden Statistic T3
Rx 'lJ -T2 'lJ -T3 'lJ
21 1 4.49 1 4.07 1
22 2 4.28 2 3.88 2
23 4 4.07 4 3.68 4
24 7 3.86 5 3.49 5
25 12 3.85 7 3.48 7
26 19 3.66 8 3.29 8
27 30 3.64 10 3.28 10
28 43 3.59 12 3.24 12
29 61 3.44 14 3.09 14
30 83 3.42 15 3.07 15
31 111 3.38 17 3.05 17
3.27 19 2.96 19
3.23 21 2.89 21
3.21 22 2.88 22
3.18 24 2.85 24
3.16 26 2.84 26
3.06 28 2.76 28
3.00 30 2.68 30
2.97 32 2.66 32
2.95 34 2.64 34
2.90 35 2.60 35
2.86 37 2.57 37
2.84 39 2.55" 41
2.79 40 2.48 42
2.76 42 2.45 48
2.74 48 2.41 49
2.70 49

a The four values listed for 255 are actually only


two pairs of true ties. To four deCImal places, two
values are equal to 2.5524 and two equal 2.5522.

and between 4.7 % (43/924) and 5.3 % inclusive. Futhermore, all three
statistics are almost monotonic functions of one another, although frequently
one stays constant while another changes. In other words, in the portion of
the tail listed for all three statistics (a little over 5 %), the possible sets of X
ranks can be put in an order such that they enter the critical region in almost
this order for all three tests, the only difference being how many enter at one
time. The one exception is where - T2 = 2.79 and - T3 = 2.55. For larger
sample sizes there would be more differences, although those between T2
and T3 are always minor.
7 InvaI rance and Two-Sample Rank Procedures 269

6 Two-Sample Rank Tests and the Y - X Differences


Tests depending only on the ranks of the Xi and ~ in the combined sample
may be called two-sample rank tests. They include those based on sums of
scores, but could also be of other forms.
We have seen in detail that the rank sum statistic is the number of positive
differences lj - Xi> and that the corresponding confidence bound for shift
is an order statistic of these differences. More generally, the confidence
bound· corresponding to any two-sample rank test is necessarily one of
these differences, as mentioned in the last section. This follows from the
fact that, as 11 varies, the ranks obtained from X I,···, X m , YI - 11, ... ,
y" - 11 change only at the values 11 == lj - Xi.
As the values of the lj - X, relate to confidence intervals for shift, so
their signs relate to tests of zero shift (identical populations). The two-
sample rank tests are exactly those tests depending only on the signs of these
differences, as long as order within samples is ignored, which in practice it
would be. In other words, two-sample permutation-invariant tests (see
Sect. 7) are rank tests if and only if they depend only on the signs of the
Y(J) - Xli), the differences of the order statistics of the two samples separately.
We shall not go into further detail about these relationships here, because
the one-sample situation is very similar and it was discussed at length in
Sect. 7 of Chap. 3. We do note, however, that the confidence bound corres-
ponding to a one-tailed two-sample quantile test was identified in Sect. 3.4
as a particular difference Y(j) - Xli), and of course the test itself depends
only on the sign of the same difference. Thus a quantile test is the simplest
case from this point of view, as from others.

7 Invariance and Two-Sample Rank Procedures


The properties of permutation in variance and invariance under other
transformations were defined for one-sample procedures in Sect. 8 of Chap. 3.
We give here analogous definitions of these properties in the two-sample
case, and also some arguments for restricting consideration to these types of
invariant procedures. Since the arguments are also analogous to those dis-
cussed extensively in Sect. 8 of Chap. 3, they will be repeated here only briefly.
They lead to a definition of two-sample, permutation-invariant rank tests
as those having both of these in variance properties.
A procedure cfJ(X I, ... , X m , YI , .. ·, y") will be called permutation in-
variant if it is unchanged by permutations of the Xi and by permutations
of the lj. In other words, cfJ is not changed if the order of the X's or the order
of the Y's is changed. In mathematical notation, cfJ is permutation invariant if
cfJ(Xi" .. ·, X,,,,, lj" ... , ~J = cfJ(X 1,···, X m , YI , · · · , Yn) (7.1)
for all permutations iI, ... , im of 1, ... , m and jl, ... , jn of 1, ... , n.
270 5 Two-Sample Rank Procedures for LocatIOn

A procedure is permutation invariant if and only if it is a function solely


of the order statistics of the two samples separately, that is, of X(1)' _.. , X(m)
and Y(l)' ... , Y(n) alone. Here X(i) denotes the ith from the smallest among the
X's and Y() is thejth from the smallest among the V's (not among the X's and
V's combined).
In the kind of situation we have been discussing, it is natural to use a
permutation invariant procedure, and all of the procedures discussed in
this chapter have been of this type. There are also two concrete arguments
against using procedures that are not permutation invariant but depend on
the order of the X's or the order of the V's separately. First, in some models
the order statistics of the two samples form a sufficient statistic so that
nothing can be gained by looking beyond them. Second, even when this
condition.fails, there is a direct in variance argument that may apply.
The order statistics of the two samples, X(1)"'" X(III)' Y(l)"'" Y(n)' form
a sufficient statistic if the XI are identically distributed and the lj are iden-
tically distributed and all are independent. (We have often made this assump-
tion. For sufficiency, it must hold under every contemplated distribution.
In testing, it must hold under the alternative as well as the null hypothesis.
Notice that it is not necessary, however, that the X's have the same distribu-
tion as the Y's.) By the properties of sufficiency, it follows that, under the
stated conditions, given any procedure there exists an equivalent, permuta-
tion invariant procedure. This procedure (possibly randomized) is based on
the order statistics alone and has exactly the same operating characteristics
as any given procedure based on the observations and their order within
each sample separately.
If the X's, or the V's, are not necessarily identically distributed, then the
order statistics of the two samples no longer form a sufficient statistic. In
this case, the foregoing, very strong, justification of permutation invariance
is no longer applicable. However, it may still seem unreasonable to permit
the procedure to depend on the order of the X's or of the V's separately.
If the observations provide intuitively the same evidence on the matter in
hand however the X's or V's are rearranged separately, then a "reasonable"
procedure <p would be unaffected by rearrangement of the X's or of the V's,
that is, would be permutation invariant. The discussion of the "principle
of invariance" in Sect. 8 of Chap. 3 applies here also with appropriate
changes (Problem 88), and hence will not be further elaborated.
Any procedure which is a function ofthe order statistics in the two samples
separately is permutation invariant. It need not be a function of the ranks.
Rank procedures, however, are the only ones which are also invariant under
another, very large and very different class of transformations, specifically,
those produced by strictly increasing functions. Invariance under these
transformations might also be desirable, and we now consider them.
Suppose for convenience that the X's and the V's are independent random
samples from two popUlations and we are testing the null hypothesis that
the populations are the same against the alternative that they are not. In
7 Invariance and Two-Sample Rank Procedures 271

other words, the X's and Y's are all independent, the X's are identically dis-
tributed, the Y's are identically distributed, the null hypothesis is that the
X's and Y's have the same distribution and the alternative is that they do not.
Let 9 be any strictly increasing function. If XI' ... , X m' Y1, ••• , Y" satisfy
the null hypothesis, then so also do g(X d, ... , g(X m), g( Y1 ), ••• , g( Y,,), and
the same applies to the alternative hypothesis. Accordingly, we can "in-
voke the principle of in variance " and require that a test treat X I> ••• , Yn
in the same way as g(X I), ... , g( Y,,). If this is required for all strictly in-
creasing functions g, then any two sets of observations with the same X
ranks and Y ranks must be treated alike, because any set of observations can
be carried into any other set with the same ranks by such a 9 (Problem 89).
In short, tests based on the ranks of the observations are the only tests which
are invariant under all strictly increasing transformations g.
The same argument applies to other null and alternative hypotheses of
the sort we have been considering, provided only that all strictly increasing
transformations 9 carry null distributions into null distributions and alter-
natives into alternatives. This holds, for instance, if the earlier alternative
hypothesis is tightened to require that the X's be stochastically larger than
the Y's (Sects. 3.8 and 4.6), or relaxed to permit the X's or Y's or both to be
not necessarily identically distributed (Problem 90). It does not hold under
the shift assumption, however (Problem 90d).
Arguments were given earlier for restricting consideration to permutation
invariant procedures, that is, for excluding procedures which depend on the
order of the X's or of the Y's separately. Applying the argument for permuta-
tion invariance, along with the argument for procedures which are invariant
under transformations by any strictly increasing function, that is, rank tests,
leads to restricting consideration to permutation invariant rank tests. These
tests depend only on the ranks of the X's and y's in the combined sample,
without regard to their order within the separate samples. The null distribu-
tions of their test statistics can be generated by taking as equally likely the
(~) separations of the ranks 1, ... , N into m X ranks and n Y ranks.
These procedures can also be defined in terms of the following indicator
variables:

I if the jth smallest observation among X I, ••. , Y" is an X


I) = {
o otherwise

for j = I, ... , N. A two-sample test is a permutation invariant rank test


if and only if its critical function ¢(X 1, ... , y") is a function only of I I' •.• , IN
(Problem 91).
As discussed for the one-sample case in Sect. 8 of Chap 3, the argument
for in variance under strictly increasing functions is far less compelling than
the argument for permutation invariance based on sufficiency. Indeed one
might not want to treat X l' .•. , Y" and g(X 1)' ... ,g(Y,,) alike in some ex-
treme instances. However, if one is content to treat alike practically all sets
272 5 Two-Sample Rank PI ocedures for LocatIOn

of observations with the same ranks, then restricting consideration to two-


sample tests is justifiable. In such a case, little is lost by using a rank test.
The choice of which rank test remains, of course.

8 Locally Most Powerful Rank Tests


There are many two-sample rank tests. In this section we consider the ques-
tion of which one has the best power function. As in the one-sample case,
there is no two-sample rank test which is uniformly most powerful against
the classes of alternatives usually considered. We must therefore be content
with more limited objectives. In Sect. 8.1 we will first find that two-sample
rank test which is most powerful against any particular alternative pair of
distributions for X and Y. We will then consider a one-parameter family of
alternatives approaching the null hypothesis of equal distributions and find
the "locally most powerful" two-sample rank test against that family, in
general, and in some specific cases. This test is always based on a sum of
scores. In Sect. 8.2 we will investigate the class of all locally most powerful
rank tests, finding that all sets of scores can arise in this way. We assume
throughout that the observations are independent, are identically distributed
in each sample, and have continuous distributions so that ties within or
across samples have probability zero.

8.1 Most Powerful and Locally Most Powerful Rank Tests Against
Given Alternatives

Let 1'1"'" I'm' 1"1"'" I'~ be the respective ranks corresponding to the
observations X I"'" X m , YI , •.• , y" after they are pooled and arranged
from smallest to largest. Thus I'i is the rank of Xi and I'~ is the rank of ~ in
the combined sample. If we distinguish different orders of the X's and Y's
within samples, then there are N! possible arrangements of these ranks
(N = m + n). We could argue that, by sufficiency, it is not necessary to
distinguish order within samples, but omit this step because our derivations
will reach this conclusion automatically. We will derive the most powerful
tests among all rank tests, not merely among permutation-invariant rank
tests. We will see that the resulting test is permutation invariant, but proving
this first would not facilitate the derivation.
As usual, consider the null hypothesis of Identical X and Y populations.
Under this hypothesis, the N! possible arrangements of the ranks are all
equally likely. By the Neyman-Pearson Lemma (Theorem 7.1 of Chap. I),
it follows that, among rank tests at level a, the most powerful test against a
simple (completely specified) alternative K rejects if the probability under
K of the observed rank arrangement is greater than a constant k, and" ac-
8 Locally Most Powerful Rank Tests 273

cepts" if it is less than k, where k and the probability of rejection at k are


chosen to make the level exactly 0: (Problem 92). This may be expressed as

"accept" if PK (I'\> ... , I'm' ,.'1"", ,.;,) < k, (8.1)

where the argument of PK denotes the observed rank arrangement. If two or


more rank arrangements have probability exactly k under the alternative K,
the probabilities of rejection when these arrangements occur need not be the
same; they need only bring the level of the test to exactly 0:.
The most powerful rank test against the simple alternative K depends,
of course, on K. Even if we restict consideration to normal alternatives with
J.l2 > J.ll and common variance (12, the most powerful rank test depends
on the values of the parameters. (This contrasts with Sect. 4.2 of Chap. 6,
where we will see that the most powerful observation-randomization test
depends only on whether J.l2 is larger or smaller than J.lI') If we consider the
situation when J.l2 is close to J.lI, however, we will find there is a "locally
most powerful" rank test against one-sided normal alternatives, namely the
one-tailed Fisher-Yates-Terry test. (This is a sum of scores test where the
score Cj is the expected value ofthejth from the smallest in a standard normal
sample of size N, as explained in Sect. 5.)
Specifically, any non-randomized, Fisher-Yates-Terry test which is one-
tailed in the appropriate direction is the unique most powerful rank test at
its level against every normal alternative with common variance and with
o < (J.l2 - J.lI)/(1 < e, for some sufficiently small, positive e. The same test
uniquely maximizes the derivative of the power function at the null hypo-
e
thesis = 0 when the power is regarded as a function of = (J.l2 - J.ll)/(J,e
rather than of the parameters separately. (The same is true of the partial
derivative with respect to J.l2, with J.ll and (J fixed, etc., but for any rank test,
e
the power is a function of alone (Problem 93) and it is natural to regard it as
such.) These statements refer only to levels achievable by a nonrandomized,
one-tailed Fisher-Yates-Terry test. At other levels, the derivative of the
e
power at = 0 is maximized among rank tests by a (randomized) Fisher-
Yates-Terry test, though not always uniquely, and maximizing the power in
e
a neighborhood of = 0 may require treating differently those different
rank arrangements for which the sum of scores equals its critical value k
(Problem 93). Similar statements about alternatives in the opposite direction
also hold, of course.
We will also prove corresponding results for the rank sum test when
the alternative distributions are logistic. The logistic distribution with
mean J.l and scale parameter (1 (and variance (121[2/3) has cumulative dis-
tribution function and density
e-(x- p )/t1

!pjx) = (J[i + e-(x- p)/t1]2' (8.2)


274 5 Two-Sample Rank Procedures for LocatIOn

(See Fig. 9.1, Chap. 3.) Specifically, consider the alternative that the X's
and Y's are drawn independently from logistic distributions with means
Jl.I and Jl.2 and common scale parameter a. We will show that any non-
randomized, rank sum test which is one-tailed in the appropriate direction
is the unique most powerful rank test at its level against every such alterna-
tive with 0 < (Jl.2 - Jl.1)fa < e, for some sufficiently small, positive e. Among
rank tests it also uniquely maximizes the derivative of the power with respect
to e = (Jl.2 - Jl.1)/a at e = o. At other levels, the derivative of the power at
e = 0 is maximized by a (randomized) rank sum test, though not always
uniquely, and maximizing the power near e = 0 may require differential
treatment of different borderline rank arrangements (Problem 95).
These results will be derived by a general method so that tests with similar
properties can be derived for anyone-parameter, one-sided alternative, and
for any alternative reducible to this form (by a strictly monotonic transforma-
tion of the data that is allowed to depend on nuisance parameters).
Although we have already used the term "locally most powerful," it has
not actually been defined here. One definition, consistent with that given
somewhat informally for the one-sample case in Sect. 9 of Chap. 3, is that a
test is locally most powerful among tests in some designated class against
some designated alternative if it has maximum power among all such tests
at all alternative distributions which are sufficiently close in some specified
sense to the null hypothesis. If we deal with an alternative which can be
indexed by a parameter e that equals 0 when the null hypothesis is true and
is positive under the alternative, and if we define "close" in the obvious
way, then this definition will require that the test be uniformly most powerful
in some interval (0, e) with e > O. Such a test will also maximize the slope of
the power function at e = O. The converse is not always true, however, and
is not automatic even when true; this difficulty is easily overlooked.
PROOFS. Consider a family of distributions Fo for the X popUlation and a
family Go for the Y population, both depending on a one-dimensional
parameter e. Suppose that e = 0 satisfies the null hypothesis of identical
distributions, that is, F0 = Go, and consider the alternative e > O.
Let [ = (r I , ...• rm. r'l , ... , r~) denote the rank arrangement and Po(r)
its probability under the alternative e. Since all rank arrangements are
equally likely under the null hypothesis, it follows (Problem 96) that a
rank test maximizes the derivative of the power at e = 0 if and only if it
is of the form

:0 Po(r) > kat e


reject if = 0

"accept" if :e P o(r)< k at e = o. (8.3)

The value of k and the probability of rejection for [ on the boundary given
by k need only be chosen so that the test has level exactly fJ.. More circumspect
8 Locally Most Powerful Rank Tests 275

behavior at k may be required to maximize power for small 0, however.


Since
1 d
Po(r) = N! + 0 dO Po(r) + terms of smaller order

as 0 --+ 0, it follows from (8.1) that the most powerful rank test against 0
is again of the form (8.3) for sufficiently small 0; however, if two or more
rank arrangements [ lie on the boundary, those with larger values of the
remainder terms must be favored for the rejection region.
At certain levels there is no room for randomization or other choice at
the boundary, and the situation is simple. Specifically, a test of the form

. I'f dO
reject d Po(r) ~ k at 0 =0

"accept" otherwise, (8.4)

among rank tests at its level, uniquely maximizes both the derivative of the
power at 0 = 0 and the power against 0 for all 0 in some interval 0 < 0 < £
(Problem 96). 0

We now seek more convenient expressions for the derivative needed in


these tests. If the X and Y populations have densities fo and go under the
alternative 0, we may write Po(r) as

(8.5)

where R is the region in (X 1, ... , X m' Y1 , ... , y")-space where the rank ar-
rangement is [. Assume it is legitimate to differentiate (8.5) under the integral
sign, and let

h1(x) = j)
j}(J log !o(x) I
0=0
1 j}O
= !o(x) j}
fo(x) I
0=0

(8.6)

Then

(8.7)
276 5 Two-Sample Rank Procedures for Location

where, on the right-hand side, Z(1) < ... < Z(N) are an ordered sample of
N from the distribution with density fo = go. Therefore the test (8.4), for
example, is equivalent to
reject if L Eo[h 1(Z(rt»] + L Eo[h 2 (Z(rJ'»] ;::: k
i )
(8.8)
" accept" otherwise
where the constant k may differ from formula to formula. This result may
also be written in the form
N
reject if L ejI) ;::: k
1
(8.9)
"accept" otherwise
where

I. = {I if Z{j) is an X
) 0 if Z{j) is a Y
and
j = I, ... , N, (8.10)
for arbitrary constants y and A, A positive. Since the test in (8.9) is equivalent
to that in (8.4), (8.9) also has the property that, among rank tests at its level,
it uniquely maximizes the derivative of the power at () = 0 and uniquely
maximizes the power against () for all () in some interval 0 < {} < 8. Notice
that the test is therefore based on a sum of scores in the sense of Sect. 5.
Similarly, (8.3) is equivalent to
N
reject if Le)I) > k
1
N
(8.11)
"accept" if L e)I) < k,
1

and at any level ex, a rank test maximizes the derivative of the power at
{} = 0 if and only if it is of the form (8.11), where the constant k and the
probability of rejection when L~ e)I) = k are such that the test has level
exactly ex.
Similar statements hold for {} < 0, with rejection when L~ e)I) is too
small.
In the case of normal shift, say Fo is N(O, 1) and Go is N(O, 1), (8.6) becomes
h2(Y) = 0,

and the Z(j) are an ordered sample from N(O, 1). The e) given by (8.10)
with A = 1, Y = 0 are therefore the expectations of the normal order statistics.
8 Locally Most Powerful Rank Tests 277

Thus we obtain the Fisher-Yates-Terry test as the locally most powerful


rank test, with the properties stated earlier, for normal shift alternatives.
In the logistic case, let X have the distribution (8.2) with parameters
J.l = eand (J = 1 while Y has parameters J.l = 0 and (J = 1. Since the logistic
distribution with (J = 1 has! = F(l - F), we have in (8.6)

hl(x) = :e log FO,I(X -0)[1 - FO,I(X - e)] 10=0 = 2Fo,I(X) - 1 (8.12)

and h 2 (y) = O. The Z(j) are now order statistics from the logistic distribution
FO,I'SO

(8.13)

where Vj = Fo, I (Z(J» is an order statistic from the uniform distribution on


(0,1). Therefore (8.10) becomes

Cj = AE(2Vj - 1) + y = 2Ajj(N + 1) - A + y. (8.14)


A test based on these Cj is equivalent to one with Cj = j, and thus the rank
sum test is the locally most powerful rank test against logistic shift alter-
natives.
The locally most powerful rank test against Lehmann-type alternatives
is investigated in Gibbons [1964a].

8.2 The Class of Locally Most Powerful Rank Tests

In the preceding subsection we showed that for any particular one-param-


eter family of alternatives, the sum of scores test with scores given by (8.10)
is locally most powerful among rank tests. We will now investigate the
family of all locally most powerful rank tests that we could obtain by varying
the alternative. As a preliminary, we note that, although the exact properties
stated after (8.10) and (8.11) imply that all locally most powerful rank tests
must be of the form (8.9) or (8.11), this form is unfortunately more inclusive
than expected or intended. The difficulty is that (8.9), though restrictive,
applies only at certain levels, while (8.11) as it stands leaves so much open that
it is completely unrestrictive. Every rank test can be expressed in the form
(8.11) by taking the Cj equal for allj. Constant cj could be excluded, but there
are also similar, less degenerate ways of using the form (8.11) to express
many tests that we would not want to call sum-of-scores tests. Thus we may
say that all locally most powerful rank tests are sum-of-scores tests, but we
lack a good definition of the latter for purposes of this statement.
We shall not pursue this definitional problem, but rather turn to the
more interesting converse question-does every sum-of-scores test arise in
this way for some one-parameter family of alternatives? We prove below that
278 5 Two-Sample Rank Procedure, for Locallon

the answer is Yes. Specifically, given any arbitrary scores Cj' there is a one-
parameter family of alternatives given by densities /0, 98 such that the Cj
satisfy (8.10) for some A > 0 and some y. Given any ci' therefore, there
exists a one-parameter alternative such that, among tests at the same level,
any test of the form (8.9) uniquely maximizes the derivative of the power at
e = 0 and uniquely maximizes the power against e for all e in some interval
o < e < e, and any test of the form (8.11) maximizes the derivative of the
power at e = O.
Roughly speaking then, the class of locally most powerful rank tests is
identical with the class of sum-of-scores tests. Intuitively, it may seem un-
reasonable that the Cj should be utterly arbitrary. The intuition not reflected
in the theoretical result is that, while any particular set of scores Cj is locally
most powerful against some alternative, this may not be at all like the kind of
alternative against which good power is desired. For example, if good
power is desired against alternatives which are one-tailed in the direction of
X larger than Yand natural in other respects, then (presumably) increasing
one of the X's should not decrease the test statistic. This implies that the
Cj should be monotonically increasing in j. In general, if the class of alter-
natives is sufficiently restricted, the locally most powerful rank tests will
not yield all sets of scores. The sets of scores which arise from restricted
classes of alternatives are complicated, however, and will not be discussed
here. We note, though, that stochastic dominance alone does not imply
monotonic Cj (Problem 103). The reader is referred to Uzawa [1960] for a
complete presentation of the conditions on the Cj which result when certain
restrictions are placed on the family of alternatives.

PROOF. We will show that every set of scores c" ... , CN satisfies (8.10) for
some positive A, some y, and some one-parameter family of the following
kind. Let the Y distribution be uniform on the interval (0, 1) and let

I8(x) = 1 + eh(x), O~x~l (8.15)

where h is a bounded function on the interval (0, 1) with

{h(X)dX = O. (8.16)

Then 10 is a density for sufficiently small e, and (8.10) becomes


Cj = AE[h(U)] + y, j = 1, ... , N, (8.17)
where U 1 < ... < UN are the order statistics in a sample of N from the
uniform distribution on (0, 1). Given a set of scores c)' our task is to find a
y, a positive A, and a bounded function h satisfying (8.16) and (8.17). It
suffices to find a bounded function q(u) for 0 ~ u ~ 1 such that
j = 1, ... , N, (8.18)
Problems 279

as then h(u) = q(u) + b will be bounded, will satisfy (8.16) for some b, and
will satisfy (8.17) with A = 1, 'Y = - b. We will use a polynomial as the
bounded function q(u). Now it is true (Problem 94d) that
E( Uk) = jU + 1) ... U + k - 1) j = 1, ... , N.
) (N + 1) ... (N + k) ,
Therefore, for q(u) = L + 1)··· (N + k)u\ Equation (8.18) becomes
ak(N
c) = '[.adU + 1) .. ·U + k - 1), j = 1, ... , N, (8.19)
k

and it remains to find ak which satisfy (8.19). The right-hand side of (8.19)
can be considered a polynomial in j. There is certainly a polynomial '[. bk /
such that
j = 1, ... ,N. (8.20)

Therefore (8.19) will be satisfied if the two polynomials are identical, that is,
if ak can be chosen so that, as polynomials in j,
'[. adU + 1) .. · U + k - 1) = '[. bk / . (8.21)
k k

It is easy to prove by induction on k (Problem 104) that/ is a linear combina-


tion of the terms jU + 1)··· U + I - 1) with I s k. It follows that there are
ak satisfying (8.21) and the proof is complete. 0

PROBLEMS

1. Let F and G be any two c.dJ.'s that satisfy the shift assumption F(x) = G(x + Jl).
Show that they have the same central moments for all k, that is EF{[ X - EF(X)]k} =
EG{[X - EG(X)]k} for all k; III particular, their variances are equal.
2. Let X and Y be any two random variables with c.dJ.'s F and G respectively that
satisfy the shift assumption F(x) = G(x + Jl). Show that Jl is the difference in
their locations, no matter how location is measured; that is, Jl is the difference
between means, modes, medians, p-points, etc., whenever these quantities exist
and are umque. What happens if the p-points are not unique?
3. For the Mosteller and Wallace data given in SectIOn 3.1 find the one-tailed P-value
accordmg to
(a) A two-sample sign test with fixed ~ = 17.0.
(b) A two-sample quantile test with t = 13.
(c) All two-sample sign tests with fixed ~ for 14 ::; ~ ::; 20.
(d) All two-sample quantile tests for 6 ::; t ::; 16.
4. Given a sample of m X's and II Y's with no ties, construct a path of N steps in the
plane, starting at the origin, such that the kth step is one unit to the right if the kth
smallest observation is an X and one unit upward if it is a Y.
(a) Show that there is a one-to-one correspondence between the possible paths
and the pOSSible rank arrangements of the X's and Y's.
(b) Describe the acceptance and rejection regions of a two-sample quantile
test in terms of these paths.
280 5 Two-Sample Rank Procedures for LocatIon

* 5. (Early decIsIOn). In a situation where the observations are obtained in order, as


in life testing, the outcome of a test or other procedure may be known early so
that sampling could be curtailed before making all observations. This is especially
true for some tests based on ranks and some confidence procedures based on order
statistics. Answer the following questIOns about early decision rules, assuming
there is no chance of ties. (Hint: Problem 4 may be helpful here.)
(a) How early can a decision be reached for (i) a one-tailed (two-tailed), two-
sample quantile test, (ii) a one-tailed (two-tailed), two-sample sign test with
fixed ~, (iii) a lower (upper) confidence bound corresponding to a two-sample
quantile test, (iv) a two-sided confidence interval corresponding to a two-
sample quantile test.
(b) Show that a one-tailed quantile test based on A for preselected t that rejects
for A :::; a (in the notation of Table 3.1) reaches a decisIOn of reject at the
wth observation If and only If t - a :::; II' :::; t and the II'th observatIOn is
l(,-a)'
(c) Show that the probability of rejection in (b) under the null hypothesIs is
(,~.;~ 1)("~;;\~)/(~)' (This result can be obtained conditionally, or equivalently,
by considenng a 3 x 2 table with row totals II' - I, I, and N - IV. Other
probabilistiC arguments lead to alternative formulas such as [(t - a)/II']
(w-": + a)(, ~ a)/(~)')
(d) Show that the test in (b) reaches a decision of" accept" at the IVth observation
if and only If a + I :::; II' :::; t and the wth observation is X(a+ I), and that this
has probability (W~ l)(m~-;''.': 1)/(~) under the null hypothesis.
(e) Find the null frequency function of the number of observations required for a
decision by a one-tailed quantile test.
(f) Show that a two-tailed quantile test that rejects for A :::; a or A :2: a' reaches a
decision of" accept" at the wth opservation if and only If t - a' + a + I < II' :::; t
and the II'th observation is either 1';,-",+ I) or X(a+ I)'
(g) Show that, under the null hypothesis, the probability that a two-tailed quantile
test requires IV observations for a decision is

P= [(
I\' -
t-a-I
I )( N- 11') + (IVa'-I
n-t+a
-I)(Nm-a'
- w)]j(N)
m
if IV :::; t - a' + a + I,
and, with the same formula for P, IS

P+ [(IVt-a'
- 1)( N- IV
n-t+a'-1
)
+
(IV -
a
1)( N-
m-a-I
II' )]j(N)
m
If t - a' + a + 2 :::; IV :::; t.
(h) What can be said about the time required for a decision in parts (d) and (g)?
(i) How could P-values be defined when a curtailed sampling procedure is used?
*6. Given two mutually mdependent random samples ofm X's and II Y's from popula-
tions with continuous c.d.f.'s F and G respectively, let U be the number of Y's
that are smaller than the median of the X sample. If the X sample is regarded as the
control group and the Y sample as the treatment group, the control median test
proposed by Kimball et at. [1957] (see also Gastwirth [1968]) is to reject the
null hypothesis F = G If U IS small Generahzmg to an arbitrary quantile, let U be
Problems 281

the number of Y's which are smaller than the kth smallest of the /II X observations.
(Hint: Problem 4 may be helpful)
(a) Show that the null frequency function of U is

P(U = II) = (k+ II - l)(N - k- lI)j(N) for II = 0, I, ... , II.


II II-II II

(Compare Problem 5c and d. U = II If and only if the (k + lI)th smallest


observation is X(k)')
(b) Show (preferably without algebra or reference to the statement in (a» that
if U = II is observed, then the P-value of a test rejecting the null hypothesis
when U is small is a hypergeometric tail probability associated with the
following 2 x 2 table. (This relates the "hypergeometric waiting-time"
distribution to the ordinary hypergeometric distribution.)
X's Y's

Below k 1/ k+1I
Above m-k II-II N - k-1/

/II II N

(c) Show that a one-tailed test based on U, which might be called a control
quantile test, always reaches the same decision as a suitably chosen, one-taIled,
two-sample quantile test, and vice versa, but this is not true for two-taIled
tests. (The lower tail is of primary interest.)
(d) What is the confidence bound corresponding to a one-tailed test based on U?
(e) In the sItuation of Problem 5, show that a decisIOn of reJect cannot be reached
early by a lower-tailed control quantile test, and a decision of "accept"
cannot be reached early by an upper-taIled control quantile test. When can
the other decisions be reached?
(f) Show that, if sampling is curtailed as soon as a decision can be reached, the
one-tailed control quantile and ordinary quantile tests coincide in all respects
(stopping time and decision).
*7. In the situation described in Problem 6, let V be the number of X's that are smaller
than the median of the Y sample. The first-median test, proposed by Gastwirth
[1968], is based on U if the median of the X sample is smaller than the median of
the Y sample, and on V otherwise. Hence it permits an earlier decision than the
control median test in some circumstances, especially in two-tailed tests. Gen-
eraltz111g to an arbItrary quantile, let X(k) be the kth smallest among the III X's and
l(l) the (I)th smallest among the II Y's and let U be the number of Y's smaller than
X(kl and V the number of X's smaller than l(n' Note that U has the dIstrIbutIon
given 111 Problem 6a (H111t: Problem 4 may also be helpful 111 answerIng the
questions below.)
(a) Show that V has the same null distributIOn as U but with different parameters.
What are the parameters '?
(b) Let the test statistic be U if X(k) < l(,) and V otherwise. Show that U is the
test statistic if and only if U :;;; I - 1 and V is the test statistic if and only if
V:;;;k-1.
282 5 Two-Sample Rank Procedures for LocatJon

(c) Let the critical value be u if X(k) < l(,) and v otherwise, where u :::; 1 - I and
v :::; k - I; express the level of this test as a sum of two hypergeometric
probabilitIes. Such a test might be called a first-quantile test. (Gastwirth
[1968] considers the case u = v, m = 2k - I, II = 21 - I.)
(d) Show that each tail of a first-quantile test always reaches the same decision
as a suitably chosen, one-tailed, two-sample quantile test, and vice versa, but
the two-tailed, two-sample quantile tests reach the same decision as only
those first-quantIle tests havmg k + /I = 1 + v. (They reach "accept" decIsions
at the same time, but the first-quantile tests reach reject decisions sooner.
See, however, parts (j) and (k) of this problem.)
(e) Fmd a convenient expression for the null conditional distributIOn of U given
that U IS the test statistic.
(f) If k = I, show that the test statistic IS min (U, V).
(g) Show that if m = 2k - I and II = 21 - I the test statistic is U with prob-
abIlIty! under the null distrIbutIOn.
(h) Show that if k = 1 and m = II then U and V are identically distributed and
the test statistic IS U with probability t under the null distrIbution.
(I) In (h), the two tails of the test are alike, so a two-tailed P-value can be defined
naturally as twice the one-tailed P-value. Discuss the problems of defining
the P-value of first-quantile tests in other situations.
(j) In the situation of Problem 5, show that a decision of reject cannot be reached
early by a first-quantile test. When can a decision of" accept" be reached?
(k) If sampling is curtailed as soon as a decision can be reached, show that the
two-tailed, two-sample quantile tests coincide in all respects with the first-
quantile test having k + II = I + v.
(I) Show that, under the null hypothesis, the probability that a first quantile
test requires IV observations for a decision is

p=
[( IV -
/-1
- w) + (wk-I
I)(N1/-1 - I)(Nm-k
- 1V)]j(N)
m
ifw:::;lI+v+I,

and, WIth the same formula for P, IS

P+ [(w - 1)(lI-u-1
II
N- II' ) + (IV -
v
1)( N- II'
lIl-v-1
)Jj(N)
//I

if u + v + 2 :::; IV < t.
8. Prove that, m each of the three SituatIOns descrIbed 111 Sect. 3.2, the condItIOnal
distribution of A gIven t, or the distribution of A for fixed t, is the hypergeometric
distribution given in (3.1),
(a) If the N observations are drawn from the same population.
(b) If N given units are assigned to the two columns by randomization.

9. GIven two samples, suppose that ties occur at the median of the combined sample
and all those values at the median are counted as .. above."
(a) Under what circumstances will the margin t used in the two-sample median
test be unchanged?
(b) Show that ties reduce t otherwise.
(c) Show that, if the values at the median are omitted, the hypergeometric dis-
tribution still applies to the resulting 2 x 2 table under the null hypothesis.
Problems 283

10. Prove that the median test accepts (rejects) the hypothesis It = 110 under the shift
assumption If the value Jlo is 1I1terior (exterior) to the random interval (3.6)
II. (a) Verify the following results given by Mood [1950, p. 396]. Under the shirt
assumption when s' ;::: sand r ;::: r'

PO",) - X(s') < II < }(r) - XIS)~

= P(X(S) < ~Ii - II and X(s') > 1(,,) - II)


= I - P(X(S) > }(,) - It) - P(X(s') < 1(,.) - II)
where

,-1 (" +i - I)(N -" - i)j(N)


P(X(s) > 1(r) - II) = I~O " _ I N - " III

P(X(s') < }(r') - Jt) = Lm ('" +.' _


i -I I)(N -,,'
_ .'- i)j(N) '
1=" I II I III

(b) Relate these "hypergeometric waiting-time" tail probabilities to ordinary


hypergeometric tail probabilities,
*12, To test the null hypothesis that two samples come from the same population,
Rosenbaum [1954] suggests a test based on the number of observations in the X
sample which lie outside an extreme value of the Y sample. Define the test statistic
S as the number of X's larger than the largest Y.
(a) Show that the null frequency function of S is f(s) = II(':)/N(N; 1). Against
alternatIves in the direction X > Y, a test based on S would reject in the upper
tail. Rosenbaum gives tables of upper critical values of S for tests at con-
servahve levels 0,0) and 0.05,
(b) For III and II both large and approximately equal, show thatf(s) is approxi-
mately equal to r(s+ 11, so that the upper-tail cntlcal values of S are 5 and 7
for approximate (J, = 0,0) and 0.05, respectively, (Part (e) strengthens this
result.)
(c) What two-sample quantile test is equivalent to rejecting for S :;:; s?
(d) Express the P-value of Rosenbaum's test as a hypergeometnc taIl probability.
(Note that the P-values for different s correspond to different two-sample
quantile tests.)
(e) Show that f(s) :;:; II(III/N),/(N - s) and peS ;::: s) :;:; N(m/N}'/(N - s) under
the null hypothesis,
*13. Use differences of X and Y order statistics to express the confidence limits for a
shift parameter that correspond to
(a) The control median test (see Problem 6).
(b) The first-median test (see Problem 7).
(c) The control quantile test (see Problem 6),
(d) The first-quantile test (see Problem 7).
(e) Rosenbaum's test (see Problem 12),
14. (a) For the one-tailed, two-sample quantile test agamst an alternatIve of two
different popUlation densities, express the power as a two-variable integral
involving the two population cumulative and density functions.
(b) Show that the power of the two-tailed test is the sum of two such integrals.
284 5 Two-Sample Rank Procedures for Location

15. With the notation of Table 3.1 for the two-sample quantile test, show that a/m
and (t - a)/n are bounded away from 0 and 1 under suitable conditions on Ill, n, t,
and (x.
*16. (a) Argue that if X has density fand c.dJ. P, then the order statistic X(i) is approxi-
mately normal with mean p-I(i/m) and variance i(1Il - i)/m3{f[p-l(i/m)]}2
under suitable conditions on i, Ill, and P. (One such argument uses the normal
approximation to the binomial distribution of the empirical cumulative
distribution.)
(b) What kind of precise limit statements along these lines would you expect to
hold, under how broad conditions?
(c) Sketch a proof, perhaps under more restrictive conditions.
17. Show that (3.11) is a necessary and sufficient condition for two random variables
X and Y to have different medians, whatever median is chosen for each if the
median is not unique.
18. Show that the median test is consistent against alternatives with different popula-
tion medians
(a) Using the idea of consistent estimation and its relation to consistent tests
(see Sect. 34, Problems 26 and 28 of Chap. 3 and Problem I of Chap. I).
e
(b) Using the consistency of the sign test with fixed as in (3.11).
e
*19. (a) Show that the two-sample sign test with fixed has the optimum properties
stated in Sect. 3.7 when the X's come from one population and the Y's from
another.
(b) Show that these properties also hold for suitable alternatives under which
neither the X's nor the Y's need be identically distributed.
*20. Show that the hypergeometric probability of a or less in Table 3.1 is less than the
binomial probability of a or less for the binomial parameters m and p = t/N if
a ::::; (mt/ N) - 1, and greater than if a ~ mt/N [Johnson and Kotz, 1969; Uhlmann,
1966].
*21. (a) Argue or show by example that in a treatment-control experiment, if the
treatment effect is not an identical shift in every unit, the level of the confidence
procedure corresponding to the median test may be less than nominal despite
random assignment of units to groups, even when both population distribu-
tions are symmetric and the treatment effect is defined as the difference
between the two centers of symmetry. (One type of example has one group
much less dispersed than the other and uses Problem 20).
(b) In (a), show that if the populations have the same shape but different scales,
then asymptotically the confidence level is always less than nominal. Assume
the population has nonzero density at the median. (Hint: Use Problem 16.)
*22. Show that the lower-tailed median test is bIased against the alternative that the
median of the X population is larger than the median of the Y population. (An
example of power less than (X can be constructed as III Problem 21 a if the difference
III medians is very small.)

23. Use Theorem 3.1 to show that a suitable median test rejects with probability at
most (X under (3.12) and at least (X under (3.13). Show that, more generally, the
test functions of one-tailed, two-sample sign tests rejecting for A large are increasing
in the Y direction.
Problems 285

24. Prove Theorem 3.1, that a test is monotonic In the distributions of the observations
if the critical function is monotonic in the observations.
25. (a) Use (3.8) to (3.10) to give an easily evaluated expression for the approximate
power of the one-tailed median test against the alternative that the X and Y
populations are both normal but with possibly different means and variances.
(b) Evaluate this power expression for m = 6, n = 9, ex = 0.10, population means
differing by 1, and population variances in the ratios 0.25, 1, and 4.
26. A sample of size t is drawn without replacement from a finite dichotomous popula-
tion of size N which contains exactly m elements with the value 1 and n = N - m
with the value O. For the sample observations X I> ••• ,X" the number of I's in the
sample is L:'t X, = tX.
(a) Show that

E(tX) = mt/N, var(tX) = mnt(N - t)/N 2 (N - I).

(b) What is the probability distribution of tX?


(c) Relate this situation to the two-sample quantile test statistic of Sect. 3.
(d) Relate this situation to the corresponding situation with a sample of size m
and a population containing t I's and N - to's.

27. Let ~ be the median of a combined sample of III X and n Y random vanables. Let
F",(e) and Gn(e) denote the respective sample proportions of X's and Y's whIch
e
are smaller than so that the median test statistic can be written as A = mFm(e).
(a) Show that the one-tailed median test that rejects for small values of A is
equivalent to a test rejecting for small values of F m(e) - G.(e).
e
(b) If n -> 00 while m remains fixed, show that converges to the median of the Y
population, find the limiting value of the two-sample median test statistic,
and show that the median test approaches the one-sample sign test (Moses
[1964]).

*28. What happens to the two-sample quantile test if N -> 00 with t fixed?

29. Verify the linear relationships between the Mann-Whitney and Wilcoxon statistics
given in (4.3) and (4.4).
30. (a) If 1"", II(U) denotes the number of arrangements of m X and n Y variables such
that the value of U, the number of (X, Y) pairs for which X > Y, equals u,
show that

for all u = 0, 1, ... , mn and all positive integer-valued m and n, with the
following initial and boundary conditions for all m ~ 0 and II ~ 0:

I"... n(lI) =0 for all II < 0 and all II > mil

1"/11,0(0) = 1"0, nCO) = 1.


ThIS provides a simple recurSIVe relatIOn for generating the frequencies of
values of U and hence for the null probability function Pm,n(lI) = FeU = II)
for sample sizes m and II, using
286 5 Two-Sample Rank Procedures for Location

(b) What change IS reqUIred In order to generate directly the null cumulative
probabilities Fm . .{u) = P(U :$ u)?
(c) What change is required in order to generate the null probabIlity function of
the X rank sum Rx? The null cumulative function of Rx?
31. Use the recursive method developed in Problem 30 to generate the complete
null distribution of U or Rx for all m + n :$ 6. Check your results using Table F.

*32. Derive results for the two-sample case that are analogs of the one-sample results
given in Problem 13 of Chap. 3. (Fix and Hodges [1955] give these two-sample
results and a table. Their work is 1I1spired Problem 13 of Chap. 3.)
33. Let R be uniformly distnbuted on the integers 1,2, ... , N and let S be independently
uniformly distributed on (0,1).
(a) Show that S has mean t and variance 112 ,
(b) Show that R + S is uniformly distributed on (I, N + 1) and hence has the
same distribution as NS + 1.
(c) Use (a) and (b) to show that R + S has mean N/2 + I and variance N 2 /l2.
(d) Use (c) and (a) to show that R has mean (N + 1)/2 and variance (N 2 - 1)/12.
34. Use the null mean and variance of Rx given in (4.7) and (4.8) to verify the cor-
responding moments of Ry and U given in (4.9) and (4.\0).

35. Show that the null distributions of U and U' are identical.

36. Show that the possible values of U, U', R x , and Ry are as stated in the paragraph
follOWIng (4.10).
37. Show that U and U' are symmetrically distributed about mn/2, Rx is symmetric
about m(N + 1)/2, and Ry is symmetric about n(N + 1)/2.
38. (a) Show that the continuity correction for the approximate normal deviate of
the rank sum test statistic is J3/mn(N + 1).
(b) Show that for 0.1 < miN < 0.9, this continuity correction is less than
IIJO.03N 3 and hence less than 0.02 if N ~ 44, less than 0.01 If N ~ 70.
(c) Show that for 0.1 < miN < 0.9, the continuity correction in the normal
apprOXImation (3.2) for the median test is less than IN/O.3foCl and
hence less than 0.02 If N ~ 27778, less than 0.0 I if N ~ 111112
39. Show that the value of k given in (4.11) and needed for the confidence bound is
one less than the rank (among all pOSSIble RJ of the lower taIl critIcal value at
level (J. and hence k + I is the rank.
40. Represent two samples of sizes m and II by a path as explained in Problem 4. This
path separates the rectangle with corners (0, 0), (0, II), (m, 0), and (/11, n) into two
parts. Show that the upper left part has area U and the lower right part has area U',
where U and U' are the Mann-Whitney statistics.
41. Given continuous c.dJ.'s F and G, show that PI' P2' and P3 as given in Equations
(4.12)-(4.14) can be written as follows, where dF(x) can be replaced by f(x) dx
if F has density J, and similarly for dG(y).
f f
(a) PI = G(x) dF(x) = 1 - F(y) dG(y).
f 2f
(b) P2 = [1 - F(y)]2 dG(y) = [1 - F(x)]G(x) dF(x).
f
(c) P3 = G2 (x) dF(x) = I - 2J F(y)G(y) dG(y).
Problems 287

42. Calculate PI' P2, and P3 defined in Equations (4.12)-(4.14) if X and Yare drawn
from identical populations, and substitute these results in (4.15) and (4.16) to
verify the results given in (4.10).

43. Derive the expression for var(U) in Equation (4.22) by using (4.18)-(4.21).

44. Natural estimators of PI, P2, and P3, as defined in Equations (4.12)-(4.14), are the
corresponding proportions of sample comparisons (X" X) which satisfy the
respective inequalities.
(a) Show that these estimators for PI and P2 can be expressed as

PI = ~,::<R. - i)/mn = U/mn

P2 = 2 L (R. - i)(m - i)/mlJ(m - 1),

where R j is the rank of X(.) III the combined sample.


(b) Give a similar expression for P3'
(c) Find the expected value of these estimators under the null hypothesis ofidenti-
cal distributions.
45. Let X I, ... , XIII and YI , .. , Y,. be independent random samples from the con-

tinuous distributions F and G respectively.


(a) Show that the Mann-Whitney statistic U can be written as n 2:7:'1 G.(X.),
where Gn(y) is the proportion of Y's that are smaller than y.
(b) If n ~ 00 while m remains fixed, U/n ~ 2:7~ I G(X.) = W, say (in probabIlity,
and with probability one). Characterize the distribution of W under the null
hypothesis.
(c) What is the limit of the rank sum test as n ~ 00 while m remains fixed? It
amounts to a one-sample test (see Moses [1964]). Is this one-sample test a
rank test? A non parametric test?
(d) Show that E(W/m) = P(X > Y) = PI as defined in (4.12) (see Moses [1964]).
46. Show that the rank sum test is consistent against the alternatives given in Sect. 4.5.
47. Show that, for some alternatives, the median test is consistent and the rank sum
test is not, and vice versa.
48. Suppose that X and Yare independent but not identically distributed, and that
X is stochastically larger than Y.
(a) Show that P(X > Y) > t.
(b) Show that E(X) > E(Y) if both expectations exist and are not both infinite.
(c) Show that the median of X is at least as large as the median of Y, provided at
least one is unique, no matter how the other is chosen.
(d) What happens in (c) if neither X nor Y has a unique median? (Of course,
(c) and (d) apply equally to quantiles of any order.)
49. Show that if F(x) = [G(x)r for all x (Lehmann alternative), then X is stochastically
larger than Y for k > 1, smaller for k < 1.

50. Show that a nonzero shift implies stochastic dominance.

51. Show that the rank sum test is consistent against stochastic dominance, and hence
by Problem 50, is consistent against shifts.
288 5 Two-Sample Rank Procedures for LocatIOn

52. Suppose the X and Y populations have the same median but, with 11Igh probability,
X is close to the median and Y is not. Show that
(a) In the notation of Sect. 4.4, PI = t, P2 = t, and P3 = !, approximately.
(b) E(U) = mn/2 and var(U) = m 2 n/4, approximately.
(c) As approximated, the mean is the same as under the null hypothesis, but the
variance IS larger for n > 2m - 1, smaller for n < 2m - I.
(d) The probability of rejection by the rank sum test must be greater than IJ. if
n > 2m - 1, less if n < 2m - 1, for some levels IJ. and some populations of
the type described.

*53. (a) What questions about the differences ~ - X, correspond to questIOns about
the Walsh averages in Problem 37 of Chap. 3? .
(b) Investigate some of the questions raised In (a).
54. Prove that the rank sum test is unbiased against the alternative that the distribu-
tIOn of every X, is stochastically smaller than the distribution of every lj.

55. Use Theorem 3.1 to show that the rank sum test rejects with probability at most IJ.
under (3.12) and at least IJ. under (3.13).
56. For a sample of X's and a sample of Y's, possibly with ties, let L be the (k + l)th
smallest difference lj - X" Show that if the Y population differs from the X
population only by a shift /-I, then peL ~ /-I) ~ 1 - IJ. ~ peL < /-I) where 1 - IJ. is
the exact probability in the continuous case. (Hint: ConSider breaking ties ran-
domly. See also Problem 107.)
57. Consider the confidence interval for shift corresponding to the rank sum test.
(a) Show that if 0 is not an endpoint, then all methods of breaking any ties lead
to the same decision, namely, to "accept" if 0 is outside and reject if 0 is
inside.
(b) Show that if 0 is an endpoint, then there are tied observations and breaking
ties one way leads to rejection, another way to acceptance.
58. Show that the value of Rx computed using the average rank procedure is equal to
U + m(m + 1)/2, where U is equal to the number of (X" lj) pairs for which
X, > lj plus one-half of the number for which X, = lj.

59. Show that when ties are handled by the average rank procedure, under Ho,
(a) The means of Rx and U are not affected by ties but the variance is reduced to
var(Rx) = var(U) = ItIn[N(N 2 - 1) - I t(t 2 - 1)]/12N(N - 1)
where t is the number of observations tied at a given rank and the sum is over
all sets of tied observations (as in Problem 52 of Chap. 3).
(b) The distributions of Rx and U may not be symmetrical.
60. Show that conditions (i)(b) and (Ii) in Section 4.7 are equivalent.

61. Show that the average rank test procedure satisfies (i)(b) and (ii) but not (i)(a)
and (iii) in Section 4.7.

62. Show that breaking ties randomly


(a) Preserves the null distribution of U, even conditionally on the pattern of ties
present.
(b) Satisfies all the conditions (i)-(iii) in Section 4.7.
Problems 289

*63. Show that the confidence regions for shift corresponding to the average rank
procedure are always intervals.
64. Show that the multiplicities of the average ranks and hence the exact pattern of
ties can be determined from the ranks without multiplicities. (For mstance, the
average ranks 1,3,6,9.5, 12.5, 14 can arise only from a combined sample with the
pattern of ties displayed in Table 42.)
65. Use the result in Problem I of Chap. 1 to show that U/mn is a consistent estimator
of PI = P(X, > ~).
66. Show that U/mn is a minimum variance unbiased estimator of PI if the only
restriction is that the X and Y distributions are continuous.
*67. Under the assumption that the X and Y populations are normal with arbitrary
means, express the minimum variance unbiased estimator of PI as conveniently
as possible for the case of variances
(a) Known.
(b) Common but arbitrary (unknown).
(c) Arbitrary, possibly not all the same.
*68. (a) Show that if the X and Y populations are normal with common variance,
then for large m and n, the variance of <I>[(X - ¥)/sj2] is approximately
1cfJ 2 W(I/m + I/n + e/N) where ~ = (l1x - l1 y)/aj2).
(b) Compare the result in (a) to the variance of U/mn.

69. How might one find the values of PI for which (U - I1lIlPI)2/Z2 equals
(a) The right hand side of (4.26)?
(b) The right hand side of (4.27)?
*70. Show that the bound on var(U) given by (4.27) is never greater than that given by
(4.26). (Hint: The mequahty to be proved is equivalent to [(I - 2p)3/2 - 1 +
3p](2m - II - 1) :::; 3p2(m - I) for 0 < P < 1. You may wish to prove and use
I - 3p :::; (I - 2p)3/2 :::; I - 3p + 2p2 for 0 < P < !-)
71. Mr. Greenthumb has come to consult you about the following problem. To test
the effect of a certain type of fertilizer on the growth of spinach, he divided his
spinach field into 20 plots, picked ten plots at random and fertilized them, and left
the other ten unfertilized. Upon harvesting he obtained the following yields, in
bushels.
Unfertilized plots 6.1, 10.2, 8.7, 6.4, 703, 10.9, 7.7, 8.4, 9.0, 9.8
Fertilized plots 10.1, 11.2, 1203, 9.2, 12.0, 11.9, 9.6, 10.8, 1003, 12.7
When you question him persistently, he admits to thinking that, m the absence
of fertilizer, the yields should be independently distributed, approximately 1101 m-
ally with the same mean and varIance for all plots. He thinks that the effect of the
fertilizer is to II1crcase the Yield by some amollnt; he IS slIre it does 110 harm.
The people paying for his research do not like dlstnbution assumptions,
however, and he has consented to analyze the data without such assumptions.
(a) What methods of analyzing his data would you suggest he consider? What
would you tell him about these methods? Be as precise as you can, but re-
member that Mr. Greenthumb, though highly intelligent, is nevertheless not a
statistician.
290 5 Two-Sample Rank Procedures for LocatIon

(b) Mr. Greenthumb would like to make a preliminary report right away. For this
purpose, analyze the data in some quick but reasonable way, even if it is not
the way you consider optimum.
72. Twenty mice are placed randomly in individual cages and the cages are divided
randomly into two groups, each of size 10. All the mice are infected with tubercu-
losis; then the mice in the second group are each given a certain drug (B) while
those in the first group are given a placebo (A). Since the drug (B) is known to be
nontoxic, those mice in the treatment group would not be expected to die sooner
than the control group. The number of days to death after infection are shown
below.
Control (A): 5, 6, 7, 7, 8, 8, 8, 9, 10, 12
Drug (B): 7, 8, 8, 8, 9, 9, 12, 13, 14, 17
(a) Test the hypothesis that the drug is without effect, using the rank sum test
and the average rank procedure to handle ties.
(b) Find the smallest and largest P-values when the ties are broken.
(c) Find a lower confidence bound for the effect ofthe drug, using a level ofO.90.
73. A professor decided that since it was necessary to give tests in overcrowded
classrooms, the temptation for eyes to wander should be minimized. He decided
to give two sets of tests, with the only difference being the order in which the
questions appeared. The tests were distributed in such a way that no student
could gain information if his eyes wandered. The test results are given below.
Determine whether there is a significant difference between the average grades for
these sets, and find a confidence interval for the difference.
Set A: 78, 68, 78, 90, 66, 75, 50, 42, 80, 74
Set B: 82, 81, 83, 95, 91
74. Suppose X and Y samples each have possible values I, 2, ... , r and the value i is
observed m, times in the X sample, n, times in the Y sample, I, ml = m and I, nj =
n. The observed frequencies can be arranged in an r x 2 table with ith row entries
m" n, and row total N, = m, + nj. Consider using the rank sum test with ties
handled by the average rank procedure.
(a) Express the test statistic in terms of the observed frequencies.
(b) How does this test relate to the ordinary chi-square test for an r x 2 table
when r = 2? When r > 2?
(c) If the possible values are some arbitrary numbers aI' a2' ... , a, instead of
1,2, ... , r, what effect would this have?
*75. (Early decision in rank sum tests). Suppose X and Y observations are obtained in
order of magnitude and a one-tailed rank sum test is to be used based on sample
sizes m and n and rejecting for Rx :;; t. Assume no ties. The results below are based
on Alling [1963]. The similar problem for censored samples is discussed in Halperin
and Ware [1974].
(a) Given the first N' = m' + n' observations, derive expressions for the minimum
and maximum possible values of Rx.
(b) When can a decision to reject first be reached? An "accept" decision?
(c) Show that a decision to reject can be first reached only after observing an X,
an "accept" decision after observing a Y.
(d) Show that a decision can always be reached early and that this is true of all
rank tests.
Problems 291

76. Show that the confidence region for a shift parameter corresponding to a one-
tailed or two-tailed test based on a sum of scores is an interval if the Ck form a
monotone sequence.
77. (a) Show that the null dIstribution of the sum of the X scores in a sample of III X's
and II Y's has mean III I7 cdN and variance IIII1[I7 cflN - (I~ cklN)2]1
(N - 1), in the notation of Sect. 5.
(b) Show without further algebra that the sum of the Y scores has the same variance
as the sum of the X scores (under all circumstances).
(c) Use the result in (a) to verify (4.7) and (4.8) for the rank sum test.
(d) Use the result in (a) to obtain the null mean and variance of the median test
statistic A.

78. (a) Argue that if Ck is approximately J[(k - 0.5)/N] or J[k/(N + 1)] for some
n
function J, then I7 cklN is approximately J(u) du and I7 cflN is approxi-
mately g J 2 (u) duo
(b) What kmds of condItions would be needed to make the approximations in (a)
good?
(c) What function J corresponds to the rank sum test?
(d) Compare the approximate and exact values for the rank sum test.
(e) What functIon J corresponds to the Flsher- Yates-Terry test? The van der
Waerden test?
(f) For these test statistics, what approximate values arise by applying the result
in (a) to the mean and variance in Problem 77(a)?
*79. Consider a test based on the sum of scores Ck for Ck = k + 2- k • Formulate the
two-sample counterpart of Problem 91 of Chap. 3 and answer the questions posed.

80. Use Theorem 3.1 to show that a sum of scores test rejects with probability, at most
ex under (3.12) and at least ex under (3.13) if the scores form a monotonic sequence.
8!' Do (a)-(O below using the Flsher- Yates-Terry procedure.
(a) Suppose the following are two independent random samples, the second
drawn from the same distribution as the first except for translation by an
amount (J (to the right if (J > 0). Test at a level near om the null hypothesis
(J = 2 against the alternative 0 < 2.

First sample: -0.2, 0.4, -0.8, -I, 0


Second sample: 0.5, 1.1, -0.3, 0.6, 2, 0, 0.8, 1.5

(b) Give the exact level of the test used in (a).


*(c) Give the confidence bound for (J which corresponds to the test used in (a).
(d) Suppose the following are two mdependent random samples, the second
drawn from the same distribution as the first except for translation by an
amount (J. Test at level near 0.10 the null hypothesis (J = 0 against the alterna-
tive 0 1= O.

First sample: 79, 13, 138, 129, 59, 76, 75, 53


Second sample: 96, 141, 133, 107, 102, 129, 110, 104

(e) Give the exact level of the test used in (d).


*(0 Give the confidence interval corresponding to the test used in (d).
292 5 Two-Sample Rank Procedures fOl Location

82. Do (a)-(f) of Problem 81 using the median test and related confidence procedure.

83. Do (a)-(f) of Problem 81 using the rank sum test and related confidence procedure.

84. Suppose that III X observations are independent and follow the uniform distribu-
tion on (0, I), and that II Y observations are mdependent and have the densIty

g(y) = (1 + (X)ya if 0 ~ y ~ I for (X > -1.


If the X and Y observations are mutually independent, find the probabilities of
all (~) possible rank arrangements If
(a) m = n = 2.
(b) m = 2, n = 3.
(c) m = n = 3.

85. What is the relation between the densities in Problem 84 and Lehmann alterna-
tives F(x) = [G(X)]k for all x?

86. Suppose that X I' X 2 have density f(x) = 1 for 0 ~ x ~ 1; YI , Y2 have density
g(y) = I for -e ~ y ~ 0 and e ~ y ~ 1; and all are independent. Find the
probability of each possible set of combined ranks. (Note that this probability
is not increasing in the X ranks even though F(x) ~ G(x) for all x.) Find a function
Q such that F(x) = Q[G(x)].

87. Given an arbitrary two-sample procedure, how can a permutation-invariant


procedure be defined which has identical statistical properties whenever the
two samples are drawn mdependently from two populations?

88. Suppose that X I> ... , X." YI>"" Y. are mdependent but neither the X's nor the
Y's are necessarily identically distributed. How might one argue for using a
permutation-invariant procedure? What are some circumstances under which
permutation in variance would clearly be inappropriate?

89. Given arbitrary X" ~, X;, Y~ for i = 1, ... , III, j = 1, ... , n, show that the X,
and ~ have the same ranks in the combined sample of all X,, ~ as do the X;, Yj in
the combined sample of all X;, Y~, if and only if there exists a strictly increasing
function g such that g(X j ) = X; and g(~) = Y~ for all i,j.

90. (a) Show that all strictly increasing transformations of the observations leave the
following hypothesis invariant: XI' ... , X .. and YI , ... , Y" are mdependent
and the X population is stochastically larger than the Y population.
(b) Show the same for the hypothesis that X I' ... , X .. , YI , ... , Y" are independent
(but not necessarily identically distributed).
(c) Show the same for the hypothesis that X I " " , X m , YI , •.• , Y" are independent
and the X's are identically distributed.
(d) Show that this invariance does not apply to the shift hypothesis.

91. Show that a two-sample test IS a permutation-invariant rank test if and only if its
critical function depends only on the indicator variables I J defined in Sect. 6.

92. Show that a rank test is the most powerful rank test at level (X against a simple
alternative K if and only if it has the form (8.1) and level exactly (x.
Problems 293

*93. Suppose the X population is N(O, 1) and the Y population is N(O, 1). Use Problem
97 to show that

where the Z(J) are order statistics from N(O, I) and 0(0 3 ) indicates terms of
order 0 3 •
(b) For sufficiently small positive 0, the most powerful rank test at leveI3/(~) rejects
if the X ranks are any permutation of(n + 1, n + 2, ... , N),(n, n + 2, n + 3, ... ,
N), or (n + I - k, 11 + k, /I + 3, /I + 4, ... , N), where k = 1 or 2, and it remains
to be determined whether k = 1 or k = 2.
(c) If m = n, both choices for k give the same coefficient of 0 and hence the most
powerful tcst IS whIchever chOIce gIves the larger value of var(Z(II+ 1 -k) +
Z(II+k) + Z(II+3) + '" + Z(N)' With appropriate tables of the variances and

covariances of normal order statistics, e.g., Teichroew [1956], it can be


verified that the two choices are not equally powerful in general.
(d) There are many sets of scores for which the sum of scores would give the most
powcrful test, mciuding the chOIce of k. The Fisher-Yates-Terry scores,
however, would not distinguish between k = 1 and k = 2 when m = n.
(e) What rank tests maximize the derivative of the power at 0 = O?
(f) Reconcile (c) and (d) with the statement that the Fisher-Yates-Terry test is
the locally most powerful rank test against normal alternatives.

94. If U 1 < U 2 < ... < UN are order statistics of a sample of N from the uniform
distribution on (0, I), show that
(a) E(U,) = i/(N + I).
(b) cov(U" U) = i(N + 1 - j)/(N + 1)2(N + 2) for i < j.
(c) var(U,) = i(N + 1 - i)/(N + 1)2(N + 2).
(d) E(U~) = i(i + 1) ... (i + r - I)/(N + 1) .. · (N + r).

*95. Suppose the X population is logistic (0, I) and the Y population is logistic (0, 1).
Use Problem 97 to show that

where the Uk are order statistics from the uniform distribution on (0, 1),
A. = e- o - I, and 0(0 3 ) indicates terms of order 03 .
(b) The statement of Problem 93b holds here also.
294 5 Two-Sample Rank Procedures for LocatIOn

(c) Both chOIces for k give the same coefficients of () and 0 2 and hence determina-
tion of the most powerful choice requires either further terms or a different
approach. (Hint: Use Problem 94. Note that E(Vn+ I-k+ Vn+k),cov(V n+ I-k +
V n+ko V), and E[(Vn+ I-k + Vn+k)VJ do not depend on k for j > II + k.
(d) The derivative of the power at () = 0 is maximized by either choice of k (and
thus by a nonrandomized test).
(e) A randomized rank sum test also maximizes the derivative of the power at
0=0. What is Its critical function and how does this test relate to the fort'-
going nonrandomized test?
96. In the two-sample problem, consider a one-parameter family of alternatives
mdexed by 0, where 0 = 0 gives the null hypothesis. Let a(O) be the power of an
arbitrary rank test and a'(O) the derivative of the power at 0 = O. Show that
(a) A rank test maximizes a'(O) among rank tests at level a if and only if it has the
form (8.3) and level exactly a.
(b) A rank test of the form (8.4) uniquely maximizes both a'(O) and a(O) for all 0
in some interval 0 < 0 < e.
97. Suppose that two populations have c.d.f.'s F and G and densitiesfand g, and that
fvanishes whenever g vanIshes. Show that the rank arrangement r has probability

Po(r) = E [61 f(Z(,,»/O(z(,,»]/ N!

where Z( I) < ... < Z(N) are the order statistics of a sample of N from the distribu-
tion with density g.
*98. Show that the locally most powerful rank test against continuous Lehmann
alternatives Fo = G I +0 is based on the scores c) = E(log V) where V) has the
beta distribution with parameters j and N - j + 1.
*99. (a) Show that the locally most powerful rank test against an alternative with
monotone likelihood ratio fo(x)/go<x) which is, say, nondecreasing in x for
every 0, IS based on monotonic scores cr
(b) Show by example that the result in (a) need not hold if the property of monotone
likelihood ratios is replaced by stochastic dominance (which is weaker).
100. Show that, when the observations are obtained in order, a two-tailed quantile
test that rejects for A ~ a and A ;::: a' first reaches a decision at the second smallest
of X(a+ I), X(a') , 1(.-a'+ I), 1(.-a)'
101. Suppose that two populations have c.d.f.'s F and G and densitiesfand g, and thatf
vanishes whenever g vanishes. Show that
(a) H(II) can be defined so that F(x) = H[G(x)J for all x.
(b) H'(u) = lI(u) = f[G-I(u)]/g[G-1(u)].
(c) P(r:.) = E[n~=1 II(V(,,»]/N! where V(I) < ... < V(N) are uniform order
statistics on (0, I),
(d) For the parametric alternative F = F 0' G = F 0,

:0 Po(r) 10=0 = ~! I~I E{a~ log fo[Fil I(V(,,»] 10=J


and this agrees with (8,7).
Problems 295

*102. Denve an expression for the scores that give the locally most powerful rank test
for alternatives under which the X and Y distributions have densIties in
(a) An exponential family Po(x) = C(O)h(x)eQ(O)T(x).
(b) A shift parameter family Po(x) = p(x - 0).
*103. (a) Show that the most powerful rank test against a simple alternative wIth
densitiesf, g and likelihood ratiof(x)/g(x) which is nondecreasing in x rejects
when the X ranks are '·1' ... , rIll if it rejects when they are r'I' ... , r;" and r; ~ r,
for all i. More generally, show that the critical function of this test as a function
of the X ranks is non decreasing in each. (Hint: Use Problem 97.)
(b) Show by example that the most powerful rank test against a simple alternative
with X stochastically larger than Y need not have this property.
104. Show that for any k, Xk is a linear combination of terms of the form x(x + I)···
(x + 1 - i) with 1 ~ k. (For instance, x 2 = -x + x(x + 1).)

*105. In the one-sample problem, consider a family of densities/o(x) = 0.5 + 0.50g(x),


-1 ~ x ~ 1, with g bounded on [ -1, 1] and J~ I g(x) dx = O. Show that
(a) The locally most powerful signed-rank test of 0 = 0 against 0 > 0 is based
on the sum of signed constants cJ with Cj = E[g(U) - g( - U)] where UJ has
the beta distribution with parametersj and n - j + 1.
(b) If g(x) = Lkak(1I + 1)··· (II + k)xlxl k - I where k runs from 1 to a finite upper
limit, then g satIsfies the condItions given above and leads to cJ = Lk adU +
1)··· U + k - 1).
(c) Any set of constants cJ arises in this way from some family/o of the type specified
and hence is locally most powerful against some alternative of this type.
106. Consider a crossover design in which III units receive the active treatment before
the control and II units receive the control before the treatment. If the 1/1 + II
response dIfferences (later - earlier) are taken as the basic data, they may be
analyzed as 111 a two-sample problem.
(a) What conditions here imply the usual "nonparametric" null hypotheses for
the two-sample problem? Are they "nonparametric?"
(b) Under what conditions would an estimate of the population difference 111 the
two-sample problem be an estimate of the treatment effect?
(c) Show that, under suitable condItIOns, a test of the null hypothesis of no treat-
ment effect can be made conditionally on the number of units with a response
change, using a 2 x 2 table with one row for increases and one for decreases.
*107. Do Problem 109 of Chap. 3 with two independent samples in place of one sample
and with the shift assumption in place of the symmetry assumption.
CHAPTER 6
Two-Sample Inferences Based
on the Method of Randomization

1 Introduction
In Chap. 4, the principle, method, and rationale of randomization tests
applicable to the one-sample (or paired-sample) case and the null hypothesis
of symmetry about a specified point were illustrated and some properties
of these tests were discussed. In this chapter we employ the same ideas in
the case of two independent sets of observations and the null hypothesis of
identical populations.
As in Chap. 5, we may have either two random samples independently
drawn from two populations or N = m + n units, m of which are chosen at
random for some treatment. Under the null hypothesis that the two popula-
tions are identical, or that the treatment has no effect, given the N observa-
tions, all (~) distinguishable separations of them into two groups, m labeled
X and n labeled Y, are equally likely. This fact was used to generate the null
distributions of all the test statistics considered in the last chapter. The
distribution of a two-sample statistic generated by taking all possible
separations of the observations into two groups, X and Y, as equally likely
is called its randomization distribution. Hence this distribution applies under
the null hypotheses mentioned above. A two-sample randomization test then
is one whose level or P-value is determined by this randomization distribu-
tion.
The two-sample tests discussed in Chap. 5 are called rank tests, since the
test statistic employed in each case is a function of the ranks of the X's and
Y's in the combined group. However, they are also members of the class of
two-sample randomization tests and hence might be called rank-randomiza-
tion tests. On the other hand, randomization tests may use a test statistic

296
2 RandomlZllllOn-Ddfercnce Between Sample Means and EqUIvalent Criteria 297

which is not a function of the ranks alone; rather, the statistic may be any
function of the actual values of the observations. To emphasize this, these
randomization tests might be called observation-randomization tests. In
general, the randomization distribution of an observation-randomization
test statistic must be generated anew for each different set of observations.
Since the randomization distribution applies under the null hypothesis
conditionally on any given set of N observed values, these randomization
tests are conditional tests, conditional on the N values observed.
In this chapter we will first discuss the two-sample observation-randomiza-
tion test based on the difference between the two sample means (or any
equivalent criterion), and the corresponding confidence procedure. Then
we will introduce the general class of two-sample randomization tests and
study most powerful randomization tests.

2 Randomization Procedures Based on the Difference


Between Sample Means and Equivalent Criteria

2.1 Tests

Given two sets of mutually independent observations, Xl"'" Xm and


Y1 , ••• , Yn , suppose we wish to test the null hypothesis that their distribu-
tions are identical. Rejection of this hypothesis implies a difference, but not
any particular type of difference. If the shift model of Chap. 5 is assumed,
or if we are particularly interested in detecting a difference in location, this
situation can be considered a location problem. Then we have the null
hypothesis H 0: J1. = 0, where J1. is the shift parameter, and a corresponding
one-sided or two-sided alternative. The test we will study here is particularly
appropriate under the shift model.
A natural choice for a two-sample observation-randomization test
statistic is the difference between the two sample means Y - X. This is the
statistic underlying the procedures used for normal populations in parametric
statistics, and we would expect it to be particularly sensitive to differences
in location if the population shape is actually nearly normal. A randomiza-
tion test could be carried out by computing Y - X for each of the (~)
separations of the actual observations into m X's and nY's, which are equally
likely under the null hypothesis. We can then see how far out the particular
observed Y - X is the tail of the randomization distribution thus generated,
and carry out a test as usual.
Such calculations are lengthy. They can be simplified, unfortunately only
a little, by using some statistic more convenient than Y - X but equivalent
for the purpose. Some specific statistics which are equivalent since they are
monotonic functions of Y - X given the combined set of observations
298 6 Two-Sample Inferences Based on the Method of Randomization

(Problem 1) are the ordinary two-sample t statistic (less convenient !), the
sum Li' X, of the X's, and S* = Lr,>m Xi - Lrj:s:m lj where I'i is the rank of
Xi and I'~ is the rank of lj in the combined ordered sample.
The calculations are most easily performed using S* when Y - X is
"large," that is in the upper tail, and using a corresponding statistic defined
in Problem 1 when Y - X is in the lower tail. If the same value of S* occurs
more than once, each occurrence must be counted separately. (Theoretically
this has probability zero if continuous distributions are assumed, but in
practice it may occur.) One will ordinarily try to avoid enumerating all
(~) possible separations. In order to find a P-value, only those (~)P separa-
tions which lead to values of S* equal to or more extreme (in the appropriate
direction) than that observed must be enumerated. If a nonrandomized
test is desired at level ex, a decision to reject can be reached by enumerating
these same (~)P separations, with P < ex, and a decision to "accept" requires
identifying any (~)ex cases as extreme as that observed. For rejection, or a
P-value, every· point in the relevant tail must be included; since it is difficult
to select the cases in the correct order, considerable care is required.
The procedure for generating the entire randomization distribution is
illustrated in Table 2.1. For m = 3, n = 4, there are (j) = 35 distinguishable
separations into X's and Y·s. Each separation is listed in the table and the
value of S* calculated for each. (The L X column is included only to illus-
trate the evaluation of a different test statistic and to make the test more
intuitive; the Y - X column is included for use in Sect. 2.4.) The observed
value of S* is 1.5 and only three of the enumerated separations produce an
S* that small. Hence the one-tailed P-value by the randomization test is
3/35 = 0.0857. Since the randomization distribution is far from symmetric
here, different ways of relating the two tails would give appreciably different
two-tailed P-values. (See also Sect. 2.4 below, Problem 5, and Sect. 4.5,
Chap. 1.)
If the null hypothesis is that the Y population is the same as the X
population except for a shift by a specified amount /10, then the foregoing
randomization test may be applied to Xl' ... , X m' Y, - /10' ... , Y" - /10' The
corresponding confidence procedure will be discussed in Sect. 2.3.

2.2 Weakening the Assumptions

In Sect. 2.1, the validity of the randomization distribution as the null dis-
tribution for the randomization test was based on the assumption that the
X and Y samples come from identical popUlations under the null hypothesis.
As noted in Sect. 1, the randomization distribution is also valid if N given
units are randomly separated into two groups, one of which is treated, and
the null hypothesis is that the treatment has no effect on any unit.
The probability ofrejection is ordinarily affected however, and sometimes
increased, if the samples are drawn from two populations which differ in
Table 2.la
X sample -0.2,0.9,2.0
Y sample 0.5,6.5, 1I.5, 14.3
X = 0.9 V = 8.2 In = 3 /I =4 S* = 1.5
Sample SeparatIOns
-0.2 0.5 0.9 2.0 6.5 11.5 14.3 IX S* v-x
X X X Y Y Y Y 1.2 0 8.16
X X Y X Y Y Y 2.3 l.l 7.53
X Y X X Y Y Y 2.7 1.5 7.30
Y X X X Y Y Y 3.4 2.2 6.89
X X Y Y X Y Y 6.8 5.6 4.91
X Y X Y X Y Y 7.2 6.0 4.68
Y X X Y X Y Y 7.9 6.7 4.27
X Y Y X X Y Y 8.3 7.1 4.03
Y X Y X X Y Y 9.0 7.8 3.62
Y Y X X X Y Y 9.4 8.2 3.39
X X Y Y Y X Y 11.8 10.6 1.99
X Y X Y Y X Y 12.2 11.0 1.76
Y X X Y Y X Y 12.9 11.7 1.35
X Y Y X Y X Y 13.3 12.1 1.12
Y X Y X Y X Y 14.0 12.8 0.71
Y Y X X Y X Y 14.4 13.2 0.48
X X Y Y Y Y X 14.6 13.4 0.36
X Y X Y Y Y X 15.0 13.8 0.12
Y X X Y Y Y X 15.7 14.5 -0.28
X Y Y X Y Y X 16.1 14.9 -0.52
Y X Y X Y Y X 16.8 15.6 -0.92
Y Y X X Y Y X 17.2 16.0 -1 16
X Y Y Y X X Y 17.8 16.6 -1.51
Y X Y Y X X Y 18.5 17.3 -1.92
Y Y X Y X X Y 18.9 17.7 -2.15
Y Y Y X X X Y 20.0 18.8 -2.79
X Y Y Y X Y X 20.6 19.4 -3.14
Y X Y Y X Y X 21.3 20.3 -3.55
Y Y X Y X Y X 21.7 20.5 -3.78
Y Y Y X X Y X 22.8 21.6 -4.42
X Y Y Y Y X X 25.6 24.4 -6.06
Y X Y Y Y X X 26.3 25.1 -6.47
Y Y X Y Y X X 26.7 25.5 -6.70
Y Y Y X Y X X 27.8 26.6 -7.34
Y Y Y Y X X X 32.3 31.1 -10.22

a These data are percent change in retail sales of Alabama drug stores from April 1971 to
April 1972. The X values are for three counties selected at random from those Alabama
counties classified as SMSA's (Standard Metropolitan Statistical Areas), and the Y values
are for randomly selected other counties. Small samples were selected so that generation of
the entire randomization distribution could be illustrated. A null hypothesis of practical
importance here is that the average percent change in retaIl sales during this period for
metropolitan areas in Alabama is not smaller than the corresponding change for less urban
areas.

299
300 6 Two-Sample Inferences Based on the Method of RandomizatIOn

any way, and in particular if the population variances differ. Nevertheless,


it is true that when the X's and Y's are all mutually independent with possibly
different distributions, the one-tailed randomization test that rejects when
y - X is too large retains its level for the null hypothesis that every Xi is
"stochastically larger" than every lj. The same statements hold with large
and larger replaced by small and smaller.
Related remarks apply, of course, to the related confidence procedures.
For further discussion and detail, see Sects. 3.8 and 4.6 of Chap. 5 and Problem
17.

2.3 Related Confidence Procedures

Under the shift assumption, the randomization test procedure can also be
used to construct a confidence region for the amount of the shift j).. Un-
fortunately, the randomization distribution is different when different values
of j). are subtracted from the Y's. The confidence region is nevertheless an
interval, and its endpoints could be obtained by trial and error by subtracting
successive values of j)., larger or smaller than previous values as appropriate,
until the value of the test statistic equals the appropriate upper or lower
critical value of a randomization test at level a for shift O. The endpoints of
the normal theory confidence interval for the difference of means at the
same level could be used as initial trial values of j)..
However, as in the corresponding one-sample problem, there is a more
systematic and convenient method which can be used to find the confidence
limits for j). exactly. Consider all pairs of equal-sized subsamples of X's
and Y's. Specifically, for the sample of m X's, consider all possible subsamples
of size r, and for the sample of nY's, consider all possible subsamples of the
same size r. Take all possible pairs of these equal-sized subsamples of X's
and Y's for all r, 1 :s; r :s; min(m, n). The total number of different pairs is

For each pair, take the difference of the subsample means, say the Y sub-
sample mean minus the X subsample mean. Consider the (~) - 1 differences
in order of algebraic (not absolute) value. It can be shown that the kth
smallest and kth largest of these differences of equal-sized subsample means
are the lower and upper confidence bounds respectively, each at level 1 - a,
that correspond to the two-sample randomization test based on Y - X
(or any other equivalent test criterion), where a = k/(~) and hence k = (~)a
(Problem 2).
To save labor, instead of using all (~) - 1 differences one could use a
smaller number, say M, if they are selected either at random without re-
placement (Problem 3) or in a "balanced" manner determined by group
theory; then a = k/(M + 1). Both methods are discussed briefly in Sect. 2.5;
2 Randomlzatlon-- Difference Between Sample Means and EqUIvalent Criteria 301

more detail for a single sample from a symmetric population was given in
Sect. 2.5 of Chap. 4.

2.4 Properties of the Exact Randomization Distribution

Unfortunately, the randomization distribution of Y - X (or any equivalent


randomization test statistic) is symmetric in general only for m = n; for
m =1= n, the two tails are not mirror images except by a remarkable coincidence
(Problem 5). This lack of symmetry makes generation of the null distribu-
tion more tedious and approximation based on moments more difficult.
Furthermore, it has the methodological consequence that there is more than
one possible definition of the two-tailed procedure. For example, the ran-
domization test based on the absolute value of the difference of means
I Y - XI is not equal-tailed; a test which rejects if I Y - XI is in the upper ex
tail of its randomization distribution (using the absolute values of the dif-
ferences) is not the same as a test which rejects if Y - X is in the upper or
lower ex/2 tail of its randomization distribution (Problem 6). Presumably,
however, the discrepancy is ordinarily small. In the example of Table 2.1,
the upper tailed P-value for I Y - XI = 7.30 is 5/35, as compared to 3/35
for Y - X = 7.30.
We will now derive the exact mean and variance of the randomization
distribution of X and of Y - X. For this purpose it is convenient to write
the latter statistic as a function of only one of the sample means, say X. If
we denote the m + n = N observations in the combined sample by ai' a2,
... , aN, we can write

- - ,,\,N
L..I aj - "\''''X
L..I j -
Y-X= -X
n

(2.1)

For the randomization distribution we can interpet X as the mean of a sample


of m drawn without replacement from the population consisting of ai' ... ,
aN' The mean and variance of this population are

and
a= ~aiN = (~Xj + ~)IN *
N
L (aj - a)2/N = S2/N
I
where
N m 11

Sr = L (aj - a)' = L (Xj - a)' + L (~ - a)'. (2.2)


I I I
302 6 Two-Sample Inferences Based on the Method of Randomization

Therefore the randomization distribution of X has mean


E(X) = a (2.3)
and variance (including the correction factor for finite populations)
- N-m
var(X) = N _ 1 (S2/N )/m = nS2/mN(N - 1). (2.4)

Furthermore, by (2.1), (2.3) and (2.4) the mean and variance (12 of the ran-
domization distribution of ¥ - X are given by
E(¥ - X) =0 (2.5)
var(¥ - X) = (12 = NS 2/mn(N - 1). (2.6)
The corresponding moments of several other equivalent test criteria are
easily derived (Problem 4).

2.5 Approximations to the Exact Randomization Distribution

If a randomization test is desired but cannot be carried out exactly, whether


by direct enumeration of the randomization distribution or otherwise, one
can use (a) simulation, (b) a restricted randomization set, or (c) the normal
or some other approximation. These procedures are described briefly here;
see Sect. 2.5 of Chap. 4 for more details.
(a) Simulation. Each sample observation is associated with a number from
1 to N and some method is used to draw m different numbers between 1
and N at random. The observations associated with the m numbers generated
are then labeled X, the rest are labeled Y, and the test criterion is calculated.
The whole process is repeated many times and the relative frequency dis-
tribution obtained for the test criterion is the simulated randomization
distribution. This provides an estimate of the P-value of the randomization
test, or the exact P-value of a redefined test, as explained in Sect. 2.5 of
Chap. 4.
(b) Restricted Randomization Set. Instead of generating separations into
X's and Y's randomly, one can generate them systematically by means of a
group of transformations. Conceptually, the simplest method would be to
consider a subgroup G of the group of all permutations of the integers
1, ... , N, apply each permutation in G to the given observations arranged
in the order X I' ... , X m • Y1 , ••• , Y", take the first m of the permuted values
as X's and the remaining n as Y's, and calculate the test statistic. The propor-
tion of the calculated values of the test statistic which are at least as extreme
as the observed value is the P-value of the test based on this restricted
randomization set. The corresponding confidence bound for a shift parameter
J.l can be obtained as follows. Assume that the identity is the only permutation
in G that does not change the separation. (Nothing is gained by considering
other permutations that do not change the separation.) For each permuta-
2 Randomization-DIfference Between Sample Means and EquIvalent Critena 303

tion in G except the identity, take the mean of all Y's that are permuted to
become X's and subtract the mean of all X's that are permuted to become
Y's. The kth smaIlest (or kth largest) of these differences of subsample
means is the lower (or upper) confidence bound corresponding to the test
having k points in the critical region, and the one-tailed C( is k divided by the
size of the subgroup.
(c) Approximations. As in the one-sample case in Chap 4, several approxima-
tions that are based on tabled probability distributions are possible. Four
will be given here, but none of them reflects the asymmetry ofthe randomiza-
tion distribution of Y - X.
A natural approximation is the standard normal distribution for the
standardized randomization test statistic (Y ...:. X)/u where u is given by
(2.6). The foIlowing reasoning suggests, however, that a better approximation
may be obtained by calculating the 'ordinary two-sample t statistic and
treating it as Student's t distributed with N - 2 degrees of freedom.
The ordinary two-sample t statistic for equality of means assuming equal
variances can be written here as (Problem 10)

t= (Y: X)[N _1~; ~ X)2/u 2r /2 . (2.7)

Comparing (Y - X)/u with a percent point Za of the standard normal dis-


tribution is then equivalent to comparing t with the constant

Za(N ~ ; ~ z;r /2
,
(2.8)

rather than with the corresponding percent point ta of Student's t distribu-


tion with N - 2 degrees of freedom. (The quantity in (2.8) and ta differ
appreciably in smaIl samples except for (J. near 0.05, although both approach
Za as N ~ 00. See Problem 15 of Chap. 4.) If the randomization test is to be per-
formed by comparing the t statistic to a constant value, then ta will be a
better constant on the average than (2.8) under normality. This suggests
that it will be better in general, at least for combined samples which are
not highly non-normal. Moses [1952] suggests the rule of thumb that this
approximation can be used when i s min s 4 and the kurtosis for the pooled
sample is close to 3.
*By matching the moments of t 2 /(t 2 + N - 2) under the randomization
distribution with the corresponding moments of the beta distribution, we
obtain another approximation. As derived below, it is equivalent to treating
t 2 as F distributed with fractional degrees of freedom d and (N - 2)d,
where
d= 2(1 - D) (2.9)
(N - 1)D - 1

D = ~ [3(m - 1)(n - 1) (1 _ 2 S4) + S4] (2.10)


mn (N - 2)(N - 3) S~ S~'
304 6 Two-Sample Inferences Based on the Method of Randomization

and Sr is given in (2.2). Recall, however, that an upper-tailed F probability


corresponds to a two-tailed t probability:
An idea of the size of the correction factor d is obtained by rewriting (2.9)
in the form (Problem 11)

d = 1+ (NN -+ 11) C2
[2mn(N - 2) _
6mn _ N 2 _ N C2
J-l (2.11)

where

N(N + I)S4/S~ - 3(N - 1)


(2.12)
C2 = (N - 2)(N - 3)/(N - 1) .

Under the null hypothesis, C2 is a consistent estimator (Problem 12) of the


population kurtosis minus 3, which is 0 for normal distributions.
A further approximation, which avoids the small fractional number of
degrees of freedom d in the numerator of F, is to treat c times the t statistic
as Student's t distributed with k degrees of freedom where k and c are given
in (2.11) and (2.12) of Chap. 4 respectively with n replaced by N - 1. This
step is exactly the same as in Sect. 2.4 of Chap. 4 (Problem 13).*

*PROOF. Under normal theory, t 2 has an F distribution with 1 and N - 2


degrees of freedom, or equivalently t 2/(t2 + N - 2) has a beta distribution
with parameters t and (N - 2)/2. We also have, by (2.7),

(¥ - xy
(2.13)
(N - 1)0'2'

This suggests approximating the randomization distribution of CV - X)2/


(N - 1)(12 by a beta distribution with the same first two moments. This
randomization distribution has first two moments (Problem 14a)

(2.14)

E[(¥ _ X)4/(N _ 1)20'4 = N[3(mn - N + 1) + (N 2 + N - 6mn)S4/S n


mn(N - 1)(N - 2)(N - 3)

= D/(N - 1) (2.15)

where D is given in (2.10). Equating the moments in (2.14) and (2.15) to the
corresponding moments of the beta distribution with parameters a and b
as given in (2.16) of Chap. 4 gives the relations (Problem 14b)

I-D
b = (N - 2)a. (2.16)
a = (N - I)D - I'
3 The Class of Two-Sample RandOll1lzatlOn Tests 305

Thus we are led to approximate the randomization distribution of (V - xy /


(N - l)a 2 by a beta distribution with the parameters in (2.16). This is
equivalent (Problem 14c) to approximating the randomization distribution of
t 2 by an F distribution with degrees of freedom 2a = d and 2b = (N - 2)d,
u~~~* 0

*Matching the first two moments of the randomization distribution of


(Y - X)2/(N - l)a 2 with those of the beta distribution is like (but not
equivalent to) matching the first two moments of the randomization
distribution of t 2 • The latter is like matching the second and fourth moments
of the randomization distribution of t. Treating the randomization distribu-
tion of t as symmetric matches also its first moment, but not ordinarily its
third (Problem 15). Hence the approximation at (2.9) and (2.10) has done
something like matching the first, second and fourth moments of t under the
randomization distribution.
We have not discussed the asymptotic normality of the randomization
distribution of (Y - X)/a. It is complicated because we really want to
know whether the conditional distribution of (Y - X)/a, given the N
observations actually obtained (but not which are X's and which are Y's),
"approaches" the standard normal distribution in some sense, not merely
whether the marginal distribution of (Y - X)/a is asymptotically standard
normal. A similar comment applies to observation-randomization tests
generally, but not to rank-randomization tests. See also the last paragraph of
Sect. 2.5 of Chap. 4. *

3 The Class of Two-Sample Randomization Tests

3.1 Definition

The level of the randomization test based on the difference between the
means of two independent samples, or on any other equivalent test criterion,
relies only on the assumption that, given the observations, say ai' ... , aN,
all possible separations into X I " ' " X m , YI , ••. , Yn are equally likely, as
they are under the null hypothesis of identical populations, or the null
hypothesis that the treatment has no effect when the treatment group is
selected randomly from the whole set. Thus it is a conditional test, condi-
tional on a I, ... , aN' In particular, it has level (X if, under the null hypothesis,
the conditional probability of rejection given ai' ... , aN is at most (x, and its
P-value is the corresponding conditional probability of a value of the test
statistic equal to or more extreme than that observed.
More generally, as stated in Sect. 1, a two-sample randomization test is a
test which is conditional on the observations a" ... , aN, its null distribution
306 6 Two-Sample Inferences Based on the Method of Randollllzation

being the randomization distribution generated by randomly separating


the combined observations into two groups of sizes m and n. Such tests may
depend on the actual values of aI' ... , aN' and may be called observation-
randomization tests to emphasize this possibility. Rank-randomization tests
are those which use only functions of the ranks of the two samples within the
combined set of N. All of the tests presented in Chap. 5 were rank-randomiza-
tion tests. The class of all randomization tests is broader than the examples
given so far might suggest; it includes analogues of the one-sample examples
given in Sect. 3.1 of Chap. 4 (Problem 19).
The procedures we have considered have all been permutation-invariant,
that is, not affected by rearrangements of the observations within the X
sample or within the Y sample (Sect. 7 of Chap. 5). For such procedures,
two arrangements of a 1> ••• , aN into X I' ... , X m' YI , ... , y" may as well be
regarded as identical if they are the same except for the order of the X's or
the order of the Y's or both, and we need distinguish only (~) distinct possi-
bilities, which we have previously called "separations." Furthermore, in
some situations the indices (labels) within samples lack reality or meaning
and it is then impossible or meaningless to distinguish more possibilities,
which would require distinguishing order within samples. (Sometimes, for
instance, only the order statistics of each sample are available.) There are
situations, however, in which it is possible and even sensible to take into
account the order within one or both samples, and there are procedures that
do so (Problem 20). It then becomes necessary to distinguish two types of
randomization distribution, according as permutations within samples are
or are not included in the randomization set.
An N I-type randomization distribution is one which includes permuta-
tions-the equally likely possibilities are all N! arrangements of the N given
values a l , ... , aN into X I ... , X m , YI ,· .. , Y,,; reorderings within samples
are allowed and counted separately. An (~)-type randomization distribution
excludes permutations-the equally likely possibilities are only the (~)
separations of the N given values a l , ••• , aN into an X sample and a Y sample
without reordering; that is, a set of m a's is drawn without replacement and,
with the a's in their original order, Xi IS the value of the ith a value in the set
drawn and ~ is the value of the jth a value not in this set. An (~)-type ran-
domization test is necessarily an N I-type randomization test (since it is more
conditional), but the converse is not true (see below). To conclude from some
given conditions that a test must be an (~)-type randomization test is stronger
than to conclude that it must be an N! type. Whether all randomization
tests of the N! type are valid, or only those of the (~) type (or not even all of
those), depends on the null hypothesis (see Problem 21). For a permutation-
invariant procedure, the two types of randomization distribution are
equivalent but the (~) type is simpler; the N! type merely counts repeatedly,
m! n! times altogether, each possibility counted by the (~) type. (Related
points arose in Sects. 7 and 8.1 of Chap 5, and the corresponding aspects of
the one-sample problem were discussed in Sect. 3.2 of Chap. 4.)
3 The Class of Two-Sample RandomizatIon Tests 307

3.2 Properties

We now turn to some of the theoretical properties of members of the class of


two-sample randomization tests.

Critical Function and Level

Denote the critical function, that is, the probability of rejection, of an ar-
bitrary test by ¢(X t> ..• , Xm; Y1, ••• , Yn). Consider first a null hypothesis
Ho under which, given the observations at> ... , aN, all N! arrangements of
them into two samples Xl' ... , X m; Y1 , ••• , y" are equally likely. Then under
H o , the conditional expected value given the aj of ¢(X 1, ••. , Xm; Yj, ... , Yn)
is simply the mean of its N I-type randomization distribution, or

(3.1)

where the sum is over the N! permutations 1tl' ••• , 1tN of the integers 1, ... , N.
Alternatively, consider a null hypothesis Hounder which, given the observa-
tions ai' ... , aN' all (~) separations into an X sample of size m and a Y
sample of size n are equally likely. Then under H 0, the conditional expected
value given the a) of ¢(X 1 " ' " Xm; Yl>"" y") is simply the mean of the
(~)-type randomization distribution, or

(3.2)

where the sum is over the (~) separations of the integers into two sets
{1tl' ••• , 1t m } and {1t m + 1,··., 1tN} with 1tl < ... < 1t m and 1tm+ 1 < ... < 1tN'
The expected value (3.1) or (3.2), whichever applies, is the conditional
probability of a Type I error for the test. Accordingly, a test ¢ has conditional
level oc, given the aj' if the quantity (3.1) or (3.2) is less than or equal to oc.
If this holds for all a I, ... , aN, then the test is a randomization test, of the N!
type orthe (~) type respectively. Any such test also has level (I. unconditionally
by the usual argument. We shall see that, conversely, a test having uncondi-
tional level oc must, under certain circumstances, have conditional level oc
given the observations a 1> ••• , aN, that is, must be an N I-type or (~)-type ran-
domization test at level oc.
The statements in Sect. 2.2 about weakening the assumptions apply here
also as long as, for those statements referring to stochastic dominance, the
critical function is suitably monotonic in the Xi and ~.
308 6 Two-Sample Inferences Based on the Method of Randomization

Justifications of Randomization Tests

Section 7 of Chap. 5 presented an argument based on concepts of invariance


to justify restricting consideration to rank tests, which are a particular case of
randomization tests. This argument required, however, the very strong
assumption of invariance under all strictly increasing transformations of the
observations. We shall now discuss quite different assumptions that lead to
randomization tests (not necessarily rank tests).

Observations Not Identically Distributed Under the Null Hypothesis. We have


already noted that, under suitable hypotheses, a randomization test has
conditional level C( given the observations and hence unconditional level C(.
It can also be shown that, if the null hypothesis is sufficiently broad, then
every test at level C( is a randomization test. Such a strong result is not
available for independent, identically distributed observations, as will be
discussed shortly. If identical distributions are not required, however, and
the null hypothesis is sufficiently broad, all level C( tests are (~)-type ran-
domization tests. This holds under either of the following null hypotheses:

H 0: The observations Z l' ... ,ZN are independent with arbitrary dis-
tributions, and X l ' ... , XIII ; Y1 , ••• , Yn are a random separation of
Z1> ... , ZN into an X sample and a Ysample.
°
H'o: H holds and the XI' ... , y" have densities.
The same conclusion also holds under less broad hypotheses, including
H~: H o holds and the ZJ are normally distributed with arbitrary means J.1j
and common variance (12.

Note that Ho, for instance, does not say that XI ... ' Xm; Y" ... , Yn are in-
dependently distributed with arbitrary distributions. This would place no
restriction whatever on the relation between the X's and the Y's and hence
could not serve usefully as a null hypothesis. By a random separation we
mean that, given Z" ... , ZN, the X's are a random sample of the Z's without
replacement but in their original order, while the Y's are the remaining Z's
in their original order.
We have in mind the kind of situation in which, for example, there are N
available experimental units, of which m are to be chosen at random to
receive a treatment and the rest to serve as controls. If the null hypothesis
is that the treatment has no effect whatever on any unit, then the random
selection of the units to be treated guarantees the validity of the level of any
randomization test. Are there any other tests at a specified level C(? Suppose
that, if all experimental units were untreated (or if the treatment had no
effect whatever), one would be willing to assume no more than Ho, that is,
that the N observations on the N experimental units are independently dis-
tributed with arbitrary distributions, possibly all different. The fact stated
3 The Class of Two-Sample RandomizatIOn Test, 309

above is that the only tests having level ex under such a weak assumption are
the randomization tests at level rx. Furthermore, the null hypothesis can be
made considerably more restrictive without upsetting this conclusion. For
example, the conclusion holds for a normal model with common variance
but arbitrary unit effects, as in H'O. It therefore holds for any null hypothesis
which permits this normal model (Problem 24b). It also holds if the unit
effects are arbitrary constants (Problem 24a).
These properties are summarized in Theorem 3.1.

Theorem 3.1. If a test has level rxfor H o , H~, or H'O, then it is an (~)-type ran-
domization test. Conversely, a randomization test has level rxfor Ho and hence
for H~ and H'O.

The proof of Theorem 3.1 for Ho is requested in Problem 24c. The proofs
for H~ and H'O involve measure-theoretic considerations and will not be
considered here. See Lehmann [1959, Sect. 5.10] or Lehmann and Stein
[1949].

Observations Identically Distributed Under the Null Hypothesis. In some


cases it may be reasonable to assume that all the X's and Y's are independent
and identically distributed under the null hypothesis. Consider, for instance,
a situation in which a random sample of m units X I' ... , Xm is drawn from
one population and a random sample of n units Y I , ... , y" is drawn from
another. If in fact the populations are identical, then all N X's and Y's are
simply a random sample from a single population; that is, they are indepen-
dent and identically distributed. An example of this occurs if m units are
drawn at random from a large population and treated in some way and n
units are independently drawn at random from the same population to
serve as controls. If the treatment has no effect on the population distribu-
tion of the characteristic measured, even though it may affect the individual
units, then X I' ... , y" are again independent and identically distributed.
This imposes conditions on the null hypothesis that were not imposed
earlier. Nevertheless, we can prove that a test having level rx for a sufficiently
broad null hypothesis of this kind must be an N !-type randomization test if
we also impose the additional condition on the test that it be unbiased
against a sufficiently broad class of alternatives. Specifically, suppose we wish
to test one of the following null hypotheses:

H 0: The variables X I' ... , y" are independently, identically distributed.


H~: The variables X I' ... , y" are independently, identically distributed
and their common distribution has a density.

Suppose also that the test cp, at level rx, is unbiased against an alternative
hypothesis which includes distributions arbitrarily close to each distribution
of the null hypothesis. Then, by the usual argument (Sect. 6.3, Chap 2), cp
310 6 Two-Sample Inferences Based on the Method of Rand0l11lzatlOn

must have level exactly IX under every null distribution included in the null
hypothesis. This in turn implies that 4> has conditional level IX given the
combined sample observations but not their assignment to Xl"'" Xm;
Yl , ... , Y". (The proof, Problem 25b, is like that in Sect. 3.2 of Chap. 4.)
Since the conditional null distribution is the N I-type randomization
distnbution, it follows that if 4> is an unbiased test of H 0 or H~ against a
sufficiently broad alternative, then it is an N I-type randomization test.
Alternatives which are sufficiently broad are (Problem 25a)

H 1 : Xl' ... , X m; Yl , ... , y" are drawn independently from two popula-
tions which are the same except for a shift.
H 'l : H 1 holds and the populations have densities.

These properties are summarized in Theorem 3.2.

Theorem 3.2. If a test has level IX for H 0 or H~ and is unbiased against one-
sided or two-sided shifts, then it is an N I-type randomization test.

The result in Theorem 3.2 " justifies" restricting consideration to N I-type


randomization tests when the X's and Y's are independent and identically
distributed within samples. Under these circumstances, the reasons given in
Sect. 7 of Chap. 5 for using a permutation-invariant procedure also apply.
As mentioned earlier, for permutation-invariant tests, the distinction be-
tween N I-type and (~)-type randomization is of no consequence. Thus we
have" justified" restriction to permutation-invariant randomization tests.
Of course, the weaknesses in the justification are the requirements for level
strictly IX or less under a very broad null hypothesis and power strictly IX or
more for a very wide class of alternative distributions.

4 Most Powerful Randomization Tests

4.1 General Case

Reasons for using randomization tests were given in Sect. 3.2. In this
subsection we will see how to find that randomization test which is most
powerful against any specific alternative distribution. The particular case
of normal shift alternatives will be illustrated in the two subsections following.
Suppose, for definiteness, that we are considering all of the N I-type
randomization tests. Then under the randomization distribution, given the
observations a l' ... , aN, all N! possible arrangements into Xl"'" X m'
Yl , ... , y" are equally likely. A randomization test is valid as long as this
condition is satisfied under the null hypothesis. Consider now an alternative
4 Most Powerful RandOlnizalion Te~ts 311

with joint density or discrete frequency function J(XI,"·' Xm, YI"'" Yn).
Under this alternative, given al"'" aN, the conditional probabilities of
each of the N! possible arrangements X I " " , X m , Y1 .••• Y,.are proportional
to {(X" ... , X"" Y" ... , Y,,). By the Neyman-Pearson Lemma (Theorem
7.1 of Chap. 1), it follows (Problem 26) that among randomization tests, the
conditional power against J is maximized by a test of the form

reject ifJ(X I " ' " X m , Y" ... , Yn ) > k


"accept" ifJ(X I " " , X m, YI , ••• , Y,.) < k. (4.1)
Randomization may be necessary at k. The choice of k and the randomization
at k must be determined so that the test has conditional level exactly a.
That is, we consider the value ofJat each arrangement Xl"'" X m, YI,"" Yn
which arises by rearrangement of the observed X I " " , X m , YI , · · · , Yn'
The arrangements XI' ... , X m, YI' ... , Yn are placed in the rejection region in
decreasing order of their corresponding values off, starting with the largest
value off, until their null probabilities total a. The region will consist of the
aN! possible arrangements which produce the largest values of J if aN!
is an integer and if the aN !th and (aN! + l)th values of J are not tied. Ties
may be broken arbitrarily. If aN! is not an integer, a randomized test will be
necessary.
Since this test is the randomization test which maximizes the conditional
power against f, it also maximizes the ordinary (unconditional) power
against! (Problem 28 of Chap. 4 covers this for the one-sample case.) Thus
we have shown how to find the most powerful N I-type randomization test
against any specified alternative!
The conditions given are necessary and sufficient. The method for (~)­
type randomization tests is similar. Of course, if a most powerful N I-type
randomization test is an (~)-type randomization test, then it is also most
powerful in the smaller class of (~)-type randomization tests.

4.2 One-Sided Normal Alternatives

Consider now the alternative that the X's are normal with mean ~I and
variance (J2, the y's are normal with mean ~2 and the same variance (J2, and
all are independent. It follows from (4.1) (Problem 27) that the upper-
tailed randomization test based on Y - X (or an equivalent statistic) is the
most powerful randomization test (of either the N I-type or the (~)-type)
against any such alternative with ~2 > ~I; that is, it is the uniformly most
powerful randomization test against ~2 > ~l' Similarly, the lower-tailed
randomization test based on Y - X is the uniformly most powerful ran-
domization test against ~2 < ~l' Note that the one-tailed tests here do not
depend on the values of the parameters, in contrast to the most powerful
rank tests of Sect. 8.1 of Chap. 5.
312 6 Two-Sample Inferences Based on the Method of Rand0l11lzatlOn

4.3 Two-Sided Normal Alternatives

Suppose that the X's and Y's are normal with common variance, as above,
and consider the alternative III ¥- 1l2' There is no uniformly most powerful
randomization test against this alternative, different randomization tests
being most powerful against III < 112 and III > 1l2' It is apparently unknown
whether there is a uniformly most powerful unbiased randomization test.
We shaH prove, however, that the randomization test rejecting for large
I Y - XI has two other properties. This test may be thought of as a two-
tailed randomization test based on Y - X, but as mentioned earlier, unless
m = n, it is not ordinarily the equal-tailed randomization test based on
Y - X because the randomization distribution of Y - X is not ordinarily
symmetric unless m = n (Problem 5).
One property of the randomization test rejecting for large I Y - X I is
that it is uniformly most powerful against III -# 112 among randomization
tests which are invariant under transformations carrying X I, ... , X III'
YI , .•• , Y,. into c - X I, ..• , C - XIII' C - YI , ... , C - Y", where c is an arbitrary
constant. Notice that such a transformation carries the alternative given by
Ill> 1l2' a into that given by c - Ill' C - 1l2' a, so the invariance rationale
(Sect. 8, Chap. 3) can be applied. In particular, any invariant test has the
same power against all alternatives with the same III - 112 and the same a
(Problem 28). The statement is that no randomization test which is invariant
under all such transformations is more powerful against even one alternative
with III ¥- 112 than the randomization test rejecting when I Y - XI is too
large.
*This randomization test is also the" most stringent" randomization test
in the situation under discussion. This property is defined as foHows. Let
a.*(IlI' 1l2' a), be the power of the most powerful randomization test against
the alternative given by (Ill' 1l2' a). Then the power rx(IlI, 1l2' a) of any other
randomization test 4> is at most rx*(IlI' 1l2' a), and the difference measures
how far short of optimum the test 4> is. Accordingly, we define

!i = max [rx*(IlI' 1l2' a) - rx(IlI' 1l2' a)]

as the maximum amount by which 4> falls short of optimum. The randomiza-
tion test which rejects for large 1Y - X 1 has the property that it minimizes
!i among randomization tests and hence is most stringent. Specifically, the
maximum amount by which the power of 4> is less than that of the best
randomization test against each alternative separately is minimized among
randomization tests by the randomization test rejecting for large 1y - X I.
For no e > 0 is there a randomization test which comes within !i - f, of
the optimum at every alternative, and the randomization test rejecting for
large 1 Y - X 1 comes within !i everywhere. Notice that this property does
not in itself exclude the possibility that another randomization test is much
better against most alternatives but slightly worse against some (Problem
31).*
4 Most Powerful RandolTIlZatloll Tests 313

*PROOF. The same basic device can be used to obtain both of the above
properties. (This is not surprising in light of the fact that a uniformly most
powerful invariant test is most stringent under quite general conditions
[Lehmann, 1959, p. 340].)
Consider the alternatives given by (JI.I' JI.2, 0-) and (c - JI.I' C - 1l2' 0-).
The average power of any test against these two alternatives is the same as its
power against

h(x l , · · · , Xm , YI"'" Yn) = LVI g(Xi; 111,0-) iI. g(Yi; JI.2' 0-)

+ ,[1 g(Xi; C - JI.I' o-)}]I g(Yj; C - 1l2' 0-)J/2 (4.2)

where g(z; JI., 0-) is the normal density with mean JI. and variance 0- 2 (Problem
32a). We shall show that, for any (JI.I' JI.2' 0-), there is a Csuch that the random-
ization test rejecting for large 1 Y - X 1 is the most powerful randomization
test against h. From this and some further arguments, the desired conclusions
follow (Problem 32c).
By straightforward calculation, the first term on the right-hand side of
(4.2) is a multiple of

ex p{ - 2~2 [ ~(x, - jj)2 + ~(YJ - jj)2

+ 2~n (1l2 - JI.I)(X - y) + n;:; (JI.2 - JI.I)2]} (4.3)

where x = Li x.!m, Y = Li Yin, and


Ii = (mJI.I + nJI.2)/N. (4.4)

The second term on the right-hand side of (4.2) is the same as (4.3) but with
C - III in place of JI.I and c - 112 in place of 1l2' and therefore with

c - It = [m(c - Ill) + n(c - 1l2)]/N (4.5)

in place of It. If c = 21t, so that the two quantities in (4.4) and (4.5) are equal,
then the density in (4.2) becomes

where cosh(t) = (e t + e- t )f2, which is an increasing function of 1t I. The


most powerful randomization test against this density is that rejecting for
large 1Y - X) 1 (Problem 32b), as was to be proved. * 0
314 6 Two-Sample Infercnccs Based on the Method of RandOlntzation

PROBLEMS

I. Show that the randomization tests based on Y - X, Y, I'i' X" D y" S*, S**, the
ordInary two-sample t statistic, and r, are all equivalent; here S* = L" >", X, -

L",;;", Y, and S** = L,,>n 1] - L,.,,;;n Xi where ri is the rank of X, and r~ is the rank
of Y, in the combined ordered sample, and r is the ordmary product-moment
correlation coefficient between the N observations and the N indicator variables
defined by
I If the observatIOn with rank k IS an X
I - {
k - 0 if the observation with rank k is a Y.
*2. Show that the confidence bounds for shift corresponding to the one-tailed, two-
sample randomization tests based on Y - X at level (l( = k/(~) are the kth smallest
and largest of the differences of equal-sized subs ample means.
*3. Consider. the (~) - I differences of equal-sized subsample means in the two-sample
problem Show that under the shift assumption
(a) The (~) intervals into which these differences diVide the real Ime are equally
lIkely to contain the true shift J.l.
(b) If M of the differences are selected at random without replacement, then the kth
smallest is a lower confidence bound for J1 at level (l( = k/(M + I).
4. Find the mean and variance of the randomization distributIOn of the statistics Y,
S*, and S** defined in Problem I.
5. (a) Show that the randomizatIon distribution of Y - X is symmetric about 0 if
(.) III = II, or (II) thc combmcd sample .s symmetnc.
*(b) Can you construct an example In which the randomization distribution of
Y - X is symmetric about 0 but neither (i) nor (ii) of (a) holds? (The authors
have not done so.)
6. Consider the samples X: 0.5,0.9, 2.0 and Y: - 0.2, 6.5, 11.5, 14.3. Show that
(a) The randomization distribution of Y - X is that given in Table 2.1.
(b) The randomization test based on loY - X, has upper-tailed P-value 6/35 and
hence rejects the null hypothesis at level (l( = 6/35.
(c) The equal-tailed randomization test based on Y - X at level (l( = 6/35 and the
lower-tailed test at level 3/35 both "accept" the null hypothesis. Find the
P-values.
7. Suppose the following arc two Independent random samples, the second drawn
from the same distribution as the first except for a shift of the amount J.l.
X : 0.2, 0.6, 1.2 and Y : 1.0, 1.8, 2.3, 2.4, 4.1
(a) Use the randomization test based on the difference of sample means (or an
eqUivalent randomization test statIstic) to test the null hypothesis It = 2 against
the alternative J.l < 2, at a level near O.DI.
(b) Give the exact P-value of the test in (a).
(c) Give the confidence bound for J.l which corresponds to the test used in (a).
(d) Find the approximate P-value based on (i) the standard normal distrIbutIOn
and (ii) Student's t distribution with N - 2 = 6 degrees of freedom.
8. Show that the randomization test based on Y - X and the two-sample rank sum
test are equivalent for one-tailed (l( :::; 2/(~) but not for (l( = 3/(~) or 4/(~) (assume
//I ~ 4, II ~ 4).
Problems 315

*9. Express the confidence bounds for shift correspondmg to the randomization test
based on Y - X in terms of the order statistics of the two samples sepal ately for
C( = kl(~), k = 1,2,3,4.

10. Verify the expressIOn for t given in (2.7).

II. Verify that (2.9) and (2.11) are equivalent expressions for the degrees-of-freedom
correction factor d.

12. Show that C 2 as given by (2.12) is a consistent estimator of the population kurtosis
minus 3 if the X's and Y's are independent and identically distributed with a suitable
number of filllte moments.

13. Show that the step relating F with fractIOnal degrees of freedom to a scaled t dis-
tributIOn 111 Sect. 2.5 is the same as in Sect. 2.5 of Chap. 4 with n replaced by N - I
(see Problem 18 of Chap. 4).

14. (a) Derive formulas (2.14) and (2.15) for the first two moments of the I andomiza-
tion distnbution of(Y - X)2/(N - 1)(J'2.
(b) Show that the beta distribution with the same first two moments has para-
meters given by (2.16).
(c) Show that approximating the randomizatIOn distribution of(Y - X)2/(N _1)(J'2
by thiS beta dlstnbutlOn IS equivalent to approximating the randomization
distribution of (2 by an F distribution with d and (N - 2)d degrees of freedom,
where d is given by (2.9).

15. Show that the randomization distributIOn of t has mean 0 but need not have third
moment O.

*16 Show that a group of transformations can be used to restrict the randomization
set as descnbed in (b) of Sect. 2.5.

*17. Let T(x t , ... , XII" Yt, ... , Yn) be nondecreasing in each X, and non increasing in
each Y1 and conSider randomization tests based on T. Show that
(a) The corresponding confidence regions are intervals.
(b) The level of the lower-tailed test remains valid when the observations are
mutually independent and every X, is "stochastically larger" than every Y,.
(c) The lower-tailed test IS unbiased against alternatives under winch the observa-
tions are mutually mdependent and every X, is "stochastically smaller" than
every Y,.
Note: Remember that the cntical value of a randomization test depends on the
observations and hence is not constant.

18. (a) Show that the randomization distribution of L X, IS the same as the null
dlstnbutlOn of the sum of the X scores for scores (/k (Sect. 5, Chap. 5).
(b) Relate formulas (2.3) and (2.4) for the mean and variance of the randomization
distribution of X to the corresponding results given in Problem 77a of Chap. 5
for sum of scores tests.
(c) Why, despite (a), is the randomization test not a sum of scores test?
19. Invent an "adaptive" two-sample randomization test and discuss the rationale for
It.
316 6 Two-Sample Inferences Ba~ed on Ihe Melhod of Randol11JZ<illOll

20. In each of the following situatIOns, Identify the real world counterparts of X I, ... ,
X n " YI , ..• , Y" and ai' ... , aN. Would It be possible to distinguish order within
samples? Meaningful? Desirable? If a randomization test IS to be used, should it be
(~) type? N !-type? What null hypothesis would be appropriate? Why? The situa-
tions are sketchIly described; give further details as you need or desire.
(a) A lIbrary ceIling has 50 lIght bulbs. Two types of bulbs are used 111 an expert-
ment and the lifetIme of each bulb IS recorded.
(b) In a library and a less well ventilated hallway with the same type of bulbs,
bulb lifetimes are recorded to see if they are affected by ventilation.
(c) A set of patients who are deemed appropriate and have given consent receives
the usual medication for some type of high fever. A randomly selected subset
of this set is also given a standard dose of an additional new drug. The temp-
erature change 111 a four-hour penod is recorded for all patients.
(d) Same as (c) except that the dose of the new drug is varied from 20% below to
20% above the standard level.
21. In one of the situations of Problem 20 or some other situation, describe a randomiza-
tion test which is not permutation invariant and why it might be desirable to use it.
22. (a) Show that a randomization test is N! type if it is (~) type.
(b) In what sense is an (~)-type randomization test more conditional than an N!
type?
23. Given k and II, 111 how many ways can one select integersjl,h, ... ,j. which are all
different and satisfy 1 ~ h < h < ... < 1k ~ 1/ and I ~ A+ I < A+ 2 < ...
< )" :S: II?

24. (a) Show that all level CI. tests are (~)-type randomization tests if the null hypothesis
IS that X I' ... , XIII; YI , ..• , Y" are a random separation of ai' ... , aN into an X
sample and a Y sample, where al' ... , aN are arbitrary constants.
(b) Suppose it is known that all tests having level CI. for some null hypothesis H~
are (~)-type randomization tests. Show that the same IS true for every H~*
wlllch contams H~.
(c) Show that all level CI. tests are (~)-type randomization tests for Ho: The observa-
tions Z I' •.• , ZN are independent with arbitrary distributions, and X I' .•. , Xm;
YI , ••• , Y" are a random separation of Z\, ... , ZN into an X sample and a Y
sample.
25. (a) Show that If a test has level CI. for the H 0 or H'o given after Theorem 3.1 and is
unbIased against one-sided (or two-sided) shift alternatives, then it has level
exactly CI. under every null distribution.
(b) Show that the result in (a) in turn implies that the test has conditIOnal level CI.
given the combined sample observations.

26. (a) Show that a test is the most powerful N !-type randomization test at level CI.
against a simple alternative if and only if it has level exactly CI. and is of the
form (4.1) where k is a function of the order statistics of the combined sample.
*(b) What change occurs in the statement in (a) for (~)-type randomization tests?
27. Show that a one-tailed randomization test based on Y - X (or any equivalent
statistic) is ul11formly most powerful against one-sided normal alternatives with
common variance.
Problcm~ 317

28. (a) Show that if a test is invariant under transformations carrying X into c - X
and Y into c - Y for all c, then its power against the alternative that X is
N(/I., ( 2) and Y is N(/l l , ( 2) depends only on II. - Jl2 and a.
(b) Show that if the test in (a) IS also invariant under changes of scale (carrymg X
into bX and Y mto bY) then its power depends only on (Jl. - Jl2)/a.
*(c) Why were changes of scale not considered in Sect. 4.3?
*29. Show that the randomization test based on I Y - XI has the "most strmgent"
property of Sect. 4.3
(a) If the alternative is restricted to the region 1II1 - 1121 > ba where b is some
given constant.
(b) If Ct.*(Jl., Jiz, a) is redefined as the maximum power achievable by any test.
*30. Show that uniformly most powerful invariant tests are "generally" most strmgent
under suitable conditions. What condition is most important?

31. In the SItuatIOn of Sect. 4.3, draw hypothetical graphs of Ct.*(Jl" Jlz, a) and of Ct.(JI"
Jl2' a) for the level Ct. randomization test based on I Y - X I as functions of (/12 -
II. )/a. Indicate how to find ~ from these graphs. What do the properties of Ct.(/I.,
Jiz, a) as "umformly most powerful invariant" and "most stringent" imply about the
graphs of the power of other randomization tests? (Assume the other randomization
tests are mvariant under increasing linear transformations so that their power
depends only on (JlI - Jlz)/a, but do not assume that they are invariant under
changes of sign.) What can you say about the power of a one-tailed randomization
test based on Y - X at level Ct.?
32. (a) Show that the power of a test against the density h given by (4.2) is the average
of its power agamst the two normal alternatives given by
(/1.,JI2,a) and (C-/I.,c-/12,a).
(b) Show that the most powerful randomization test against h is that rejectmg for
large I Y - XI·
*(c) From this, show that the randomization test rejecting for large I Y - X I is
uniformly most powerful invariant and most stringent against normal alterna-
tives as stated in Sect. 4.3.
CHAPTER 7
Kolmogorov-Smirnov
Two-Sample Tests

1 Introduction
We have not previously discussed the use of criteria suggested by direct
comparison of empirical (sample) cumulative distribution functions with
one another or with hypothetical c.dJ.'s (" goodness of fit "). This important
approach leads to a wide variety of procedures which stand apart from the
procedures of earlier chapters in several respects. They are expressed in a
different form. The relevant statistics are not approximately or asymptotically
normally distributed. The theory of their asymptotic behavior is fascinating
and raises different kinds of problems requiring different kinds of tools. The
mathematical interest of these and other problems has played a larger role
than statistical questions in motivating the extensive literature about them,
although there is also some excellent work on statistically important
questions.
The one-sample test procedures require that a completely specified dis-
tribution be hypothesized. In this sense they are not "nonparametric."
Although the test statistics are "distribution-free" as that term is usually
defined, they relate to null hypotheses that are entirely different from those in
Chaps. 2-4, which require only symmetry. The two-sample procedures, how-
ever, relate to the" nonparametric" null hypothesis of identical distributions
used in Chaps. 5 and 6.
The Kolmogorov-Smirnov criterion of maximum difference, defined
below, has received the most attention. Another natural criterion is the
Cramer-von Mises integrated squared difference. These and other variations
of them involving "weights" are the only specific criteria developed in this
tradition which have been broadly investigated for statistical purposes.
Pearson's chi-square goodness-of-fit criterion, which compares cell
frequencies rather than cumulative frequencies, is even more popular. It

318
2 Empirical DistributIOn FunctIOn 319

could be viewed as a comparison of changes in c.d.f.'s, rather than a com-


parison of c.d.f.'s themselves.
We will discuss only the Kolmogorov-Smirnov criterion here because it
is the only well-developed goodness-of-fit criterion that is at all competitive
with the other procedures considered in this book against shift and similar
alternatives. Also, simple confidence procedures can be based on it, as well as
tests. The scope of this book does not extend to testing goodness of fit in the
usual sense, especially not to preliminary tests to be followed by further
analysis. A brief comment on preliminary testing is in order, however,
because preliminary testing followed by a parametric analysis if the null
hypothesis is" accepted" may be considered an alternative to non parametric
methods. The deviations from parametric assumptions that are most likely
to occur are typically large enough to have a significant and perhaps even
disastrous effect on a parametric analysis, and yet are not large in a goodness-
of-fit sense. Typical goodness-of-fit tests, or others that could be devised,
have low power against these alternatives for usual sample sizes, and hence
provide little and inadequate protection. The accept/reject approach is not
really appropriate to the purpose. Type II errors are serious while a Type I
error is not (devising an unnecessary alternative analysis), and power is
low. All this suggests that an appropriate significance level will be so large
that one might as well assume rejection without even testing. The ques-
tions are whether the parametric analysis may be seriously invalidated by
deviations from assumptions, and if so what to do about it. The answer to
the first question usually depends far more on the particular form of analysis
and the background of the data than on anything detectable by a goodness-of-
fit statistic. The answer to the second question lies in sensitivity analyses and
robust procedures, including non parametric methods.
For reasons evident from these preliminary remarks, we give in this
chapter a separate introduction to Kolmogorov-Smirnov procedures, em-
phasizing the two-sample case, and include more references than usual
because we cannot go deeply into the theory.

2 Empirical Distribution Function


Given any sample from an unspecified population, a natural estimate of the
unknown cumulative distribution function of the population is the empirical
(or sample) distribution function of the sample, defined, at any real number t,
as the proportion of sample observations which do not exceed t. For a
sample of size m, the empirical distribution function will be denoted by F m
and may be defined in terms of the order statistics X(I) S X(2) S ... S X(m)
by
if t < X(l)
if X{J) s t < X(j+ 1)' 1 s j < m (2.1)
if t ~ X(m)'
320 7 Kolmogorov-Smirnov Two-Sample Tests

The following properties of F m(t) are easily proved (Problems 1 and 2) for
observations which are independently and identically distributed with
c.dJ. F.
(a) The random variable mFm(t) follows the binomial distribution with
parameters m and F(t).
(b) The mean and variance of F m(t) are
E[Fm(t)] = F(t), var[Fm(t)] = F(t)[l - F(t)]/m.
(c) Fm(t) is a consistent estimator of F(t) for fixed t.
(d) Fm(t) converges uniformly to F(t) in probability, that is,
P[IFm(t) - F(t)1 < e for all t] -t 1 as m - t 00, for all e > O.
(e) Fm(t) converges uniformly to F(t) with probability one (Glivenko-
Cantelli Theorem).
(f) F m(t) is asymptotically normal with mean and variance given in (b).
(g) The empirical distribution is the mean of the indicator random variables
defined by
15 () = {I if X j ~ t
•t 0 otherwise,
that is,
m
Fm(t) = Ib.(t)/m.
j= I

In particular, F I(t) = 15 I (t). The covariance between values of bj(t) for the
same i but different t is
F(tI)[1 - F(t 2 )] if tl ~ t2
{
cov[b.(t l ), bj (t 2)] = F(t 2)[1 _ F(tl)] ift2 ~ tl

= a(tl' t 2 ), say.
(h) cOV[Fm(t,), F m(t2)] = a(tl' t 2)/m.
(i) For any fixed t 1o ••• , tko the random variables Fm(td, ... , F",(tk) are
asymptotically multivariate normal with mean vector [F(t l ), ..• , F(t k)]
and covariance matrix with (i,j) element a(t" tj)/m. This can be proved
by applying the multivariate Central Limit Theorem for identically
distributed random variables to the vectors b. = [bj(t I)' ... , b;(tk )] or by
way of the multinomial distribution of the increments F m(t I), F m(t2)
- Fm(t l ) , · · · , Fm(tk) - Fm(tk-I)' 1 - Fm(tk)'

3 Two-Sample Kolmogorov-Smirnov Statistics


In the previous two chapters we discussed a variety of tests for the situation
where two independent samples, X I' ..• , Xm and Y1, ... , Y,., are drawn from
populations with unspecified c.dJ.'s F and G respectively, and the null
3 Two-Sample Koltnogorov-Smlrnov StatIstics 321

hypothesis is F = G. Now the empirical distribution functions of the X


and Y samples, denoted by Fm(t) and Gn(t) respectively, estimate their respec-
tive population c.d.f.'s. If the null hypothesis is true, there should be close
agreement between F m(t) and Gn(t) for all values of t. Some overall measure
of the agreement or disagreement between the functions F m and Gn is then
a natural test statistic. The Kolmogorov-Smirnov two-sample test (some-
times called simply the Smirnov test) is based on the maximum difference;
of course, F m(t) and Gn(t) are in close agreement at all values of t if the
maximum difference is small. Specifically, the test statistics are given by
(3.1)

(3.2)

D;;'n = max [F m(t) - Gn(t)], (3.3)

where (3.1) is called the two-sided statistic since the absolute value measures
differences in both directions, and (3.2) and (3.3) are called the one-sided
statistics. Appropriate critical regions are to reject F = G if Dmn is "too large"
for a two-sided test, if D;!n is "too large" for a one-sided alternative G 2 F
and if D;;'n is "too large" for a one-sided alternative G ~ F. (Assume each
of these alternatives excludes F(t) = G(t) for all t.)
Tests based on these statistics would appear to be sensitive to all types
of departures from the null hypothesis F = G, and hence not especially
sensitive to a particular type of difference between F and G. However, even
for location alternatives, against which most of the two-sample tests pre-
sented in this book are designed to perform well, the Kolmogorov-Smirnov
statistics are sometimes quite powerful. They are primarily useful, however,
when any type of difference is of interest.
Alternative expressions of the Kolmogorov-Smirnov statistics are more
easily evaluated, and they also show that the maxima are achieved. In (3.2)
for instance, note that reducing t to the next smaller lj does not change
Gn(t) and hence can only increase the maximand. Therefore, we can write
D;!n = max [Gn(lj) - FmClj)]
)

= max [U/n) - F mCl(j»]


)

= max [U/n) - (M/m)] (3.4)

where l(j) is thejth smallest Y and M j is the number of X's less than or equal
to l(j). Similarly,
D;;'n = max [Fm(X,) - Gn(Xj)]
j

= max [(ilm) - (N./n)] (3.5)


322 7 Kolmogorov-SmJrflOV Two-Sample Tests

where N j is the number of y's less than or equal to X(I)o the ith smallest X.
The two-sided statistic is simply
(3.6)
These representations (and others, Problem 3), also make it evident that
the Kolmogorov-Smirnov statistics depend only on the ranks of the X's
and Y's in the combined sample. Thus the tests based on them are two-sample
rank tests. In particular, their distributions under the null hypothesis that
the X's and Y's are independent and identically distributed with an un-
specified common distribution do not depend on what that common dis-
tribution is as long as it has a continuous c.dJ. (This result is also evident
otherwise; Problem 3.) The same null distributions hold also in the situation
of say a treatment-control experiment where m units are selected at random
from N to receive treatment, if the null hypothesis is no treatment effect and
the distribution of the characteristic being measured is continuous so that ties
have probability O. (Ties are discussed in Sect. 5.)

4 Null Distribution Theory


In order to apply the Kolmogorov-Smirnov one-sided or two-sided tests,
we need to find the null distributions of the statistics so that critical values
or P-values can be determined. An obvious and straightforward method of
enumerating the possibilities in order to obtain the null distribution of the
Kolmogorov-Smirnov statistics (and many others) will be described in
Sect. 4.1. Using this method, Kim and Jennrich prepared extensive tables of
the exact null probability distribution of Dmn; these tables appear in Harter
and Owen [1970]. A few values are given in Table G of this book. Section 4.2
contains brief comments on the relation between the one-tailed and two-
tailed procedures. In Sect. 4.3 we derive simple expressions in closed form
for the null distributions of both the one-sided and two-sided statistics for
the case m = n. These are useful in themselves and asymptotic distributions
are easily derived from them, as in Sect. 4.4, which also gives a heuristic
argument that the same asymptotic distributions apply for m =1= n. Much of
the mathematical interest lies in making the asymptotic distribution theory
rigorous. It is also needed as an approximation for finite m and n beyond the
range of tables. Hodges [1957], however, reports a brief numerical investiga-
tion of accuracy, and shows that this approximation can be quite poor and
the accuracy fluctuates wildly as m, n change. In particular, it is much better
for m = n than for m = n + 1, same n.
The null distributions are usually attributed to Gnedenko and the Russian
school (see, for example, Gnedenko and Korolyuk [1951] or Gnedenko
[1954]), but many others have worked on finding convenient methods of
computation and expression. See, for example, Massey [1951a], Orion
[1952], Tsao [1954]. Korolyuk [1954] and [1955], Blackman [1956], Car-
4 Null Dlstnbution Theory 323

valho [1959], and Depaix [1962]. Hodges [1957] includes a useful review of
algorithmic methods. (See also Steck [1969] for results that also make use of
the first place where the maximum occurs.) Hajek and Sidak [1967] give a
good summary of the results known about the Kolmogorov-Smirnov
statistics. Darling [1957] also gives a valuable exposition on these and
related statistics and a rather complete guide to the literature through 1956.
Barton and Mallows [1965] give an Appendix with references on subsequent
developments.
Throughout this section we assume (as does most of the literature most
of the time) that the common distribution of the independent random
variables X and Y is continuous, so that, with probability one, no two ob-
servations are equal. Then we can ignore the possibility of ties either within
or between samples. Ties will be discussed in Sect. 5.

4.1 An Algorithm for the Exact Null Distribution

We now describe a simple counting or addition procedure for finding the


probability P[D mn < c] = P[max,IGn(t) - Fm(t) I < c] under the nul1 hy-
pothesis.
Represent the combined arrangement of X's and Y's by a path in the
plane (as in Problem 4, Chap. 5) that starts at (0,0) and moves one unit
up or to the right according as the smal1est observation is a Y or an X, from
there moves one more unit up or to the right according as the second smal1est
is a Y or an X, etc. Thus, the kth step is one unit up if the kth smallest observa-
tion is a Y, one unit to the right if it is an X, and the path terminates at (m, n).
(For example, Fig. 4.1 represents the combined sample arrangement X Y Y
X X Y Y with m = 3, n = 4.) In this manner, all possible arrangements of
observations from the two samples are placed in one-to-one correspondence
with all paths of this sort starting at (0,0) and ending at (m, n). Under the
null hypothesis, the paths are equally likely.
For any t, let u and v be the number of X's and Y's, respectively, that do
not exceed t. Then F m(t) = u/m and Gn(t) = v/n. Furthermore, the point
(u, v) lies on the path representing the sample, and as t varies, the point

(m, n)

(0, 0) L..--L---'---'---'---'----'---
Figure 4.1
324 7 Kolmogorov-SmIrllOv Two-Sample Tests

(u, V) reaches every lattice point (point with integer coordinates) on the
path. Thus the event Dmn < c occurs if and only if each such lattice point
(u, v) on the path satisfies
I(u/m) - (v/n) 1 < c. (4.1)
Consider this event geometrically. The expression v/n = u/m is the equa-
tion of the diagonal line which connects the origin (0, 0) and the terminal
point of the path (m, n); the vertical distance from any point (u, v) on the
path to this line is 1v - (nu/m) I. Accordingly, Dmn < c occurs if and only if
the path stays always within a vertical distance of cn from the diagonal con-
necting (0, 0) to (m, n) (or equivalently, a horizontal distance of cm).
Let A(u', v') be the number of paths from (0,0) to (u', v') which stay within
this distance, i.e., which satisfy (4.1) for u ::;; u', v ::;; v'. (This number depends
also on m, n, and c, but since they are fixed throughout we do not display
this dependence notationally.) Since every path from (0,0) to (m, n) has
equal probability under the null hypothesis, the probability we seek is

P(Dmn < c) = A(m, n)/(m; n) = A(m, n)/(~). (4.2)

It is obvious that A(u, v) satisfiies the recursive relation


A(U - 1, v) + A(u, v-I) if (u, v) satisfies (4.1)
A (u v) = { (4.3)
, 0 otherwise
with boundary conditions
A(O,O) = 1, = 0 if u < 0 or v < o.
A(u, v)
By carrying out this recursion as far as.u = m, v = n, the quantity A(m, n)
needed in (4.2) is obtained. For small m and n, this can be done by hand by
drawing the two boundary lines (u/m) - (v/n) = ±c in the rectangle with
corners (0,0), (m, 0), (0, n) and (m, n) and entering the values of A(u, v) at
the points (u, v) successively which are inside (not touching) these boundary
lines (Problem 4). Unfortunately, the whole recursion must be carried out
anew for each m, n, and c. For larger but not too large sample sizes, it is
easily done by computer. Some properties of A(u, v) are given in Problem 6.
An analogous procedure can be used to find the probability of staying
within any given boundary (Problem 7). In particular, it can be used for
P(D;:;n < c). (4.4)
We simply define A(u', v') as the number of paths from (0, 0) to (u', v') along
which (u/m) - (v/n) < c, that is, which stay below the upper boundary line
at a distance cn above the diagonal. An alternative procedure may be pre-
ferable however (Korolyuk [1955]; see also Hodges [1957] and Problem
23). It is clear by symmetry (Problem 9) that
P(D,~n < c) = P(D:'II < c) = P(D;;'n < c) = P(Dn~1I < c) (4.5)
for all m, n, and c.
4 Null DistributIOn Theory 325

4.2 Relation Between One-Tailed and Two-Tailed Procedures

The critical regions of the Kolmogorov-Smirnov tests are the complements


of the events just considered; the level or P-value is the probability that the
sample path reaches (or crosses) the relevant boundary. Note that the upper
and lower one-tailed procedures are not based on the same statistic, and
their critical regions are not in general disjoint. The events D;n ~ c and
D;;'n ~ c both occur for the same sample if the path reaches both boundaries.
Therefore, although the two-tailed critical region is the union of two one-
tailed critical regions with equal critical values, the two-tailed level may be
less than twice the one-tailed level for that critical value, and the two-tailed
P-value may be less than twice the one-tailed P-value (but not more). For
c > 0.5, rejection by both one-tailed tests at once is impossible. Thus
P(D mn ~ c) = P(D;;'n ~ c) + P(D;n ~ c) = 2P(D;n ~ c) for c > 0.5. (4.6)
If m and n are both odd, this also holds for c > 0.5 - 0.5/max(m, n) (Problem
10d). Although this is not true for all smaller values of c (Problem tOe), the
discrepancy is very small in the tails, even in relative terms. (For m and n
large or equal, it is less than twice the fourth power of the one-tailed prob-
ability and hence the relative error is less than 1% for one-tailed prob-
abilities of 0.2 or less (Problem 11). This may be true for all m and n. For
numerical results and references, see Miller [1956].) In short, for practical
purposes, if the P-value is small enough to be of interest, the two-tailed
P-value may be taken to be twice the one-tailed P-value. Thus one-tailed
and two-tailed critical values are related in the usual way in most cases,
although complete statements about exact critical values are complicated
because of the discreteness of the distribution.
In calculating power, the possibility of a path reaching both boundaries
may have a more serious effect.

4.3 Exact Formulas for Equal Sample Sizes

In the special case where m = n, expressions can be given in closed form for
the null c.dJ.'s of both the one-sided and two-sided Kolmogorov-Smirnov
statistics. Specifically, we will show that, for k = 1, 2, ... , n,

P(D:" ~ k/n) = (n ~ k) / enn) = (n!)2/(n + k)!(n - k)! (4.7)

P(Dnn ~ k/n) = 2[(n 2n- k) _ (2n )


n - 2k
+ ( 2n ) _ .. . ]/(2n)
n - 3k n

[n/kJ .
= 2L:(-I)'+1 ( 2n . ) / (2n) (4.8)
I~ 1 n - Ik n
326 7 Kolmogorov-SmlfllOv Two-Sample Tests

(m + n, n - m)

Figure 4.2

where [nlk] denotes the largest integer not exceeding nlk. This gives the
entire distribution in each case, since the statistics can take on values only
of the form kin when m = n. These formulas are easily evaluated, but they
apply only to the case m = n.
We now derive these results (following the methods of Drion [1952] and
Carvalho [1959], although they are much older). We again represent the
combined sample arrangement by a path starting at the origin, but the
argument will be easier to follow if we make the steps diagonal. Specifically,
we move one unit to the right and one unit up for each Y, one unit to the right
and one unit down for each X. This gives the same path as before except for
a rotation clockwise by 45° (and a change of scale). Figure 4.2 depicts the
path constructed by this rule for the same sample as Fig. 4.1, X Y Y X X Y Y.
The path now ends at (m + n, n - m), which is (2n, 0) when m = n.
Furthermore, by analysis like that at (4.1), the event Dn~ 2 kin occurs
if and only if the path reaches a height of at least k units above the horizon-
tal axis before it terminates at (2n, 0). We shall prove shortly that the number
of such paths is (n~nk)' Since all paths are equally likely under the null
hypothesis, and since the total number is (~~), it follows that the probability
P(D:" 2 kin) is (n=-",,)/(~~) as given in (4.7). The number of paths reaching
height k is given by setting I = 0 in the following lemma, which will be needed
later for both negative and positive I and k.

Lemma 4.1. Let N(k, l) be the number of paths going from (0, 0) to (2n, 21)
and reaching a height of k units. Suppose k is not between 0 and 2/. (If it is, all
paths terminating at height 21 obviously reach height k.) Then

N(k, I) = ( 2n ) = ( 2n ). (4.9)
n-k+1 n+k-l

PROOF. Paths of this sort can be put into one-to-one correspondence with
paths terminating at height 2k - 21 by reflecting that portion of each path
to the right of the point where it last reaches height k, as in Fig. 4.3. The number
4 Null DistributIOn Theory 327

(2n,2k - 21)
(0, k) ~-------+-~~~~~----I
(2n,2/)

(0, 0) ~----+---------------1

(0, j) f - - - - - - V - - - - - - - - - - - - - l
(0, -k)~----------------I

Figure 4.3

of these reflected paths is simply the total number of paths corresponding to


samples having n - k + I X's and n + k - I Y's, namely (n_2k+/). 0

For the two-sided statistic, consider the event Dnn ~ kin. It occurs if and
only if the path reaches at least k units above or below the horizontal axis. By
symmetry, the number of paths reaching height - k is the same as the number
reaching height k, which was found above. The difficulty is that some paths
may reach both k and - k. We will count paths according to the boundary
they reach first. It is convenient to extend the notation of Lemma 4.1, letting

NU, k, I) = the number of paths going from (0,0) to (2n, 21) reaching
heights i and k, j first;
N(notj, k, l) = the number of paths going from (0,0) to (2n,21) reaching
height k without having reached heightj.
(Note that, in either case, heights j and k may subsequently be reached any
number of times.) In this notation the number of paths satisfying Dnn ~ kin
is the number reaching the upper boundary first plus the number reaching
the lower boundary first, which is

N(not -k, k, 0) + N(not k, -k, 0) = 2N(not -k, k, 0) (4.10)


by symmetry.
It follows immediately from the definitions that
N(notj, k, l) = N(k, I) - NU, k, I). (4.11 )
Furthermore, the reflection used in the proof of Lemma 4.1 now shows (see
Fig. 4.3; this is the crucial step) that
N(j, k, l) = N(j, k, k - I)
= N(not k, j, k - l) if j ~ k ~ 2k - 21 or j ~ k ~ 2k - 2/.
(4.12)
328 7 Kolmogorov-Smlfllov Two-Sample Tests

By applying (4.11) and (4.12) alternately, we can now evaluate (4.10).


This amounts to repeated application of the reflection principle. Each
application turns out to move the terminal height further from 0, until
eventually no path can reach this height (the last term of (4.11) vanishes) and
the process stops. We thus obtain, using (4.9) also,
N(not -k, k, 0) = N(k, 0) - N( -k, k, 0)

= (n ~ k) - N(not k, -k, k),

N(not k, -k, k) = N( -k, k) - N(k, -k, k)

= (n :n2k ) - N(not -k, k, -2k),

N(not -k, k, -2k) = N(k, -2k) - N( -k, k, -2k)

= (n :n3k ) - N(not k, -k, 3k),

etc. Continuing and substituting back each time, we obtain

N(not -k, k, 0) = (n ~ k) - (n :n2k) + (n :n3k) - .... (4.13)


If we define (n :jk) = 0 for n <jk, this series terminates exactly when the
number of paths corresponding to the next term is O. Multiplying (4.13)
by 2 gives the number of paths satisfying Dnn ~ kin, and dividing this result
by the total number of paths, (~n) as before, then gives (4.8).

4.4 Asymptotic Null Distributions

For m = n, the exact formulas (4.7) and (4.8) for the null tail probabilities
lead directly to the asymptotic null distributions of the one-sided and two-
sided Kolmogorov-Smirnov statistics. We first investigate the behavior of
(4.7) for large k and n. The right-hand side can be written as

( 2n )/(2n) n(n - 1) ... (n - k + 1)


n-k n =(n+k)(n+k-I) ... (n+l)

(1 - n : k)(l - n + ~ _ 1) ... (1 - n : I)
r
=

= [1 - ~ + o(~) kin ~ if 0 (4.14)

where o(x) denotes terms of order smaller than x. If we now substitute


1+x + o(x) = eX+o(x)
4 Null DistnbutlOn Theory 329

with X = - kin in (4.14), we obtain

(n ~ k) / (~n) = e-k2/n+O(k2/n). (4.15)

To obtain limits other than 0 or 1, k must be of order In. Normalizing by


j;j2, for reasons evident later, we have by (4.7)
P(J;iiD:" ~ A) = P(D:" ~ Afo/n)

where k now and hereafter is the largest integer not exceeding Aj2n. Thus,
for A fixed, e/n -4 2A2 and we find by (4.15) that
(4.16)
n-->oo

This is a very simple, easily calculated expression for the asymptotic prob-
abilities. We note also that n(D:")2 is asymptotically exponentially distribu-
ted and 2n(D:")2 is asymptotically chi-square distributed with 2 degrees of
freedom (Problem 28), so that exponential or chi-square tables could be
used.
The limiting distribution of the two-sided statistic is found the same way
but using (4.8) as follows.
P(J;iiD nn ~ A) = P(D nn ~ Aj2n/n)

=2[~\-1Y+I(
1
i=
2n. )/(2n)
n - Ik n
(4.17)

where k is the largest integer not exceeding Aj2n, as before. Substituting ik


for k in (4.15) we find

lim( ~n. )/(2n) = lime-i2k2/n = e-2i2).2. (4.18)


n
n-->oo Ik n n-+oo

If we now substitute (4.18) in (4.17) we obtain the result

lim P(,;;;iiD nn ~ A) = 2 L (_1)'+ le- 212 ).2.


00

(4.19)
n-oo i== 1

Taking the limit as n -4 00 term by term in (4.17) can be justified by the fact
that the sum is, for each n, and in the limit, an alternating series whose
terms decrease monotonically in absolute value to 0 as i -4 00, and therefore
dominate the sum of all later terms.
Now we will show heuristically that if D~n and Dmn are suitably standard-
ized, their limiting distributions do not depend on how m and n approach 00,
and hence are the same for all m and n as for m = n, so that the expressions
on the right-hand sides of (4.16) and (4.19) apply also to unequal sample
sizes.
330 7 Kohnogorov-Smlrnov Two-Sample Tests

For two independent samples, with arbitrary but fixed t 1 , ••• , t k , the
random vectors
(4.20a)
and
(4.20b)
are independent. By the property (i) in Sect. 2, if the observations are iden-
tically distributed with c.dJ. F, these random vectors are both asymptotically
multivariate normal with the same mean vector and with covariances
a(tj, t)/m and a(tj, tj)/n respectively. It follows that the vector
jmn/(m + n)[Gn(tl) - F m(tl)], ... , jmn/(m + n)[Gn(tk) - Fm(t k)]
is asymptotically multivariate normal with mean vector 0 and the co variances
a(t" t). Hence for each t 1 , ••• , tb the quantities jmn/(m + n)[Gn(t i )
- Fm(t j)], i = 1, ... , k, have the same limiting joint distribution however
m and n approach 00. This suggests that the maximum, or the maximum
absolute value, will exhibit this same property, that is, that (4.16) and (4.19)
generalize to

(4.21)
00

P(jmn/(m + n)Dmn ~ A) _ 2L:(_1)i+l e -2i 2;.2 (4.22)


i= 1

as m and n approach 00 in any manner whatever. It is far from a proof,


however, because the joint limiting distributions of the values of a function
on arbitrary finite sets do not suffice to determine the limiting distribution
of its maximum.
The asymptotic distribution of the two-sided, one-sample statistic (see
Sect. 7) was derived by Kolmogorov [1933]. Smirnov [1939] simplified this
proof and proved corresponding results for the one-sided and two-sample
cases as well. Feller [1948] used generating functions in connection with
recursive relations. Doob [1949] made a transformation to the Wiener
process (Problem 55) which shows heurtistically how these and other
asymptotic distributions relate to first passage probabilites for that process
and he derived the results in this way; see also Kac [1949]. Donsker [1952]
justified this approach rigorously. A simple proof by van der Waerden
[1963] and later work by Steck [1969] should also be mentioned. For
further references, see the literature cited in the introduction to this section.

5 Ties
The finite sample and asymptotic null distributions of the Kolmogorov-
Smirnov statistics discussed in Sect. 4 hold exactly only for a common dis-
tribution F which is continuous, and they do not depend on F as long as it is
6 Performance 331

continuous. On the other hand, if F is discontinuous, the distributions do


depend on F and hence the statistics are no longer distribution free. However,
the tests are conservative; that is, the critical values at level ex for F continuous
are conservative if F is not continuous. Equivalently, P(D mll ;;::: c) and
P(D;!" ;;::: c) are smaller for F discontinuous than for F continuous.
One way to see this is to observe that the presence of ties reduces the
range of possibilities over which the maximum is taken and hence can only
make the statistics smaller. Specifically, consider D;;"" as defined by (3.4) for
instance, and let D;!n* be the value of D;!n which would be obtained by break-
ing the ties at random. Define D!" similarly. The null distributions of D;!n*
and D!" are the same as in the continuous case, because all sample arrange-
ments of X's and Y's remain equally likely if ties are broken at random.
Furthermore, D;!n :S D;!,,* because breaking ties can decrease but cannot
increase the number of X's less than or equal to Y(j); similarly, Dmn :S D!".
Therefore the upper tail probabilities of D;!" and Dmn are no larger for dis-
continuous F than for continuous F; actually they are smaller (Problem 29).
The general method of Problem 106 of Chap. 5 does not apply directly
here because the Kolmogorov-Smirnov statistics are not continuous func-
tions of the observations (Problem 30a). It can, however, be extended to
"lower-semicontinuous" functions, which these statistics are (Problem 30b).

6 Performance
Suppose that the X's and Y's are independent random samples from popula-
tions with c.d.f.'s F and G respectively. We have already seen that the sample
c.dJ. F m(t) for the X sample is a consistent estimator of the population c.dJ.
F(t), not merely at each t, but in the stronger sense of (d) in Sect. 2; we call this
property "strong consistency" temporarily. An equivalent statement
(Problem 2b) is that
Dm = sup IFm(t) - F(t) I -+ 0 in probability.

Here Dm is the one-sample Kolmogorov-Smirnov statistic for the hypo-


thesized distribution F, discussed later in Sect. 7. Because there may be no
maximizing t, Dm is defined as a supremum (sup), which means least upper
bound. (Similarly, the infimum (inf) is the greatest lower bound. For any set
of numbers, the supremum and infimum always exist, but either or both may
be infinite.)
Actually, stronger properties than strong consistency have already been
given. The Glivenko-Cantelli Theorem says that Dm -+ 0 with probability
one, and the asymptotic results discussed in Sect. 4.4 imply that Dill is of order
l/fo in probability. All we need here, however, is that Fm is a strongly
consistent estimator of F. Similarly, G" is a strongly consistent estimator of G.
If m and n both approach 00, it follows that G" - Fm is a strongly consistent
332 7 Kolmogorov-SmlrllOv Two-Sample Tests

estimator of G - F and hence that Dmn , D~n' and D;;;n are consistent estima-
tors of the corresponding population quantities, namely, the maxima (or
suprema) of IG(t) - F(t) I, G(t) - F(t), and F(t) - G(t) respectively (Prob-
lem 31).
Consistency properties of the Kolmogorov-Smirnov tests follow in turn.
Under the null hypothesis F = G, all three population quantities are 0, and
the statistics are distribution free if the common population is continuous
and stochastically smaller otherwise. Therefore, at any fixed level, the
critical values of all three statistics approach 0 as m and n approach 00. On
the other hand, the statistics converge in probability to positive values, and
consequently each test is consistent, whenever the corresponding population
quantity is positive. Specifically, the two-sided test which rejects for large
values of Dmn is consistent against all alternatives F, G with F =1= G, that is,
with F(t) =1= G(t) for some t. (Details of the proof are left as Problem 32.)
The one-sided test which rejects for large values of D~n is consistent against
all alternatives F, G with F(t) < G(t) for some t, and similarly for D;;;n and
alternatives with F(t) > G(t) for some t. Note that these one-sided alter-
natives include stochastic dominance and the shift model of Sect. 2 of Chap. 5
as special cases.
We now derive quite simply a lower bound on the power of the one-sided
test and the behavior of this bound as m, n -+ 00 [Massey, 1950b]. The
bound and its asymptotic behavior provide useful insight and will be relevant
in Chap. 8. Let cmn . cx denote the right-tailed critical value of D~n for a test
at level a. Define L\ = sup, [G(t) - F(t)] and suppose, for convenience, that
the maximum is actually achieved so that L\ = G{to) - F{to) for some to.
We know that L\ > 0 if F(t) < G(t) for some t. Since D~n is certainly never
less than Git o) - Fm(to), the power of the test satisfies the inequality

P(D~n ~ cmn,cx) ~ P[Gn{to) - Fm{t o) ~ cmn,cx]. (6,1)

Since mFm(t O) and nGn(t o) are independent and binomially distributed (see
(a) of Sect. 2), the right-hand side of (6.1) can be evaluated without great
difficulty for any specified F and G. Furthermore, for m and n large cmn,cx
in (6.1) can be approximated by means of (4.21) and the binomial distribu-
tion can be approximated by the normal distribution (see (f) of Sect. 2).
The result is (Problem 33)

P(D~n ~ cmn,cx) ~ I - <P[(ccxJN/mn - L\)/a] approximately, (6,2)

where

Ccx = J -log a/2 (6.3)

and

(J = J{F(to)[l - F(to)]/m} + {G(to)[l - G(to)]/n}, (6.4)


6 Performance 333

Using the further inequality (J ~ J N /4mn, we obtain


P(D'!II 2:: cmll ,,,) 2:: 1 - <l>[2(c" - !!Jmn/N)] approximately. (6.5)
If both m and n approach infinity, the right-hand side of (6.5) approaches 1
since Jmn/N --4 00, and the power, which is larger, must also approach 1.
Thus (6.5) implies in particular the consistency of the test based on D;:;n
against the alternative F(t) < G(t) for some t, as stated earlier. It is a stronger
result than consistency, however, since it also gives a lower bound on the
rate at which the power approaches 1. Since the power of optimum parametric
tests is typically of the same order of magnitude, as we will show in Chap. 8,
this lower bound also shows that the Kolmogorov-Smirnov tests are not in-
finitely less efficient than other tests, as one might have feared. Rather, their
power is of the same order of magnitude, whether greater or less. Rigorous
statements along these lines unfortunately require analysis of the order of
magnitude of the neglected terms, which is difficult; for example, the normal
approximation to the binomial distribution has relative error approaching
infinity in the smaller tail when the approximate normal deviate becomes
infinite at the rate .)"ii, as it does here for fixed !!.
Equation (6.1) gives a lower bound for the power function of the one-
sided Kolmogorov-Smirnov test and an analogous bound is easily found
for the two-sided test, as is an upper bound for the one-sided test (Problem
34). These bounds can always be evaluated for simple alternatives which
specify both F and G, but of course how sharp they are is not known in
general.
Some specific studies of power have been reported in the literature. For
example, Dixon [1954], Epstein [1955] and van der Waerden [1953] made
some power comparisons for various tests against alternatives of normal
distributions differing only in location. These results show that the Kolmo-
gorov-Smirnov tests are less powerful than the Wilcoxon, van der Waerden,
Terry or t tests, but more powerful than the median test, for normal alter-
natives. While such results are indeed of interest, they cannot really provide
any general conclusions for other alternatives. The calculation of power
functions under more general, non parametric alternatives would be valuable.
Lehmann [1953] considered this problem, and Steck [1969] found formulas
for the distribution of the one-sided statistic under Lehmann alternatives
(one c.dJ. is a power of the other). Unfortunately, it appears that power is
difficult to compute in general except by simulation.
Under most alternatives even the limiting distributions of these statistics
are unknown so that it is not possible to compute asymptotic power either.
However, bounds on the power can be calculated for large m and n using
(6.5) and similar inequalities mentioned earlier. Capon [1965] used this
approach to find a lower bound for the asymptotic (Pitman) efficiency of
these tests, which can be applied to any specified distributions. Klotz [1967]
found similar results for asymptotic efficiency as defined by Bahadur.
334 7 Kolmogorov-Smlrnov Two-Sample Tests

Pitman efficiency will be discussed extensively in Chap. 8 of this book and


the Kolmogorov-Smirnov procedures are considered in Sect. 11 of that
chapter.

7 One-Sample Kolmogorov-Smirnov Statistics


So far in this chapter we have been concerned only with inferences based on
two independent samples using the Kolmogorov-Smirnov two-sample
statistics. Analogous one-sample statistics can be used for certain kinds of
inferences concerning the distribution of a single random sample.
For example, given n independent, identically distributed random
variables Xl, ... , X n from a c.dJ. F, suppose we wish to test the null hypo-
thesis H 0: F = F o. Let Fn(t) denote the empirical distribution function of this
sample, and consider the following three statistics which are called the
Kolmogorov-Smirnov one-sample statistics:
(7.1)

(7.2)

(7.3)

Against the two-sided alternative F{t) "# Fo(t) for some t, a test rejecting the
null hypothesis for large values of Dn is natural, and is consistent (Problem
35). Similarly, a test based on D: is consistent against all alternatives F
with F{t) > F oCt) for some t, and one based on D;; is consistent against
F(t) < F oCt) for some t. Curiously enough, however, each of these tests is
biased (slightly) against the corresponding alternative {Massey [1952a]
(Problem 36». When F 0 is not continuous, tests using the null distribution
or critical values for the continuous case are conservative (Problem 37).
Of course, the c.dJ. F 0 must be fully specified in order to calculate the
value of any of these one-sample test statistics. In this sense, the tests are
"parametric," and accordingly are discussed only briefly here. The statistics
are however "nonparametric," or at least distribution-free, in the sense
that when F = F0, their distributions do not depend on F 0 as long as F 0 is
continuous (Problem 38). Because of this, the problem of deriving their exact
null distribution in the continuous case need be solved for only one con-
tinuous F 0, for instance, the uniform distribution on (0, 1). These null dis-
tributions are continuous. Further, the one-sided statistics, D: and D;;,
have identical distributions by symmetry. Massey [1950a] (see also Kolmo-
gorov [1933]) derived a recursive relation for the null probability P(D n ::; kin)
for integer values of k; his method applies also to Dn+. Birnbaum and Tingey
[1951] found an expression in closed form for the entire cumulative null
7 One-Sample Kolmogorov-Smlrnov StatIstics 335

distribution of the one-side.d statistics (Problem 50); Smirnov [1944] and


Dempster [1955] showed that the asymptotic distribution can be derived
from this expression (Problem 52). References for the asymptotic null dis-
tribution of JnDn and JnD: were given at the end of Sect. 4.
Tables of the exact c.dJ. of Dn are given in Birnbaum [1952]; these are
useful to find P-values. Massey [1951b] gives critical values of Dn for n = 1,
2, ... , 20, 25, 30, 35, (X = 0.01,0.05, 0.10, 0.15, 0.20, and Miller [1956] gives
critical values of D: for n = 1,2, ... , 100, C( = 0.005, O.ol, 0.025, 0.05, 0.10.
The latter tables may also be used for Dn at level2C( for 2C( S 0.10 (and at level
2C( = 0.20 with probably small error).
Each one-sample Kolmogorov-Smirnov statistic is the limit of the cor-
responding two-sample statistic as one of the sample sizes becomes infinite,
in a strong sense of convergence in probability like the strong consistency
discussed in Sect. 6 (Problem 39); for instance, Fn - Gm converges to Fn - G
as m ~ 00. Presumably, then, the large sample properties of the one-sample
procedures could be deduced from those of the two-sample procedures. In
fact, however, it may be be more straightforward to deduce the asymptotic
behavior of the two-sample statistics from that of the individual empirical
distribution functions of each of the two samples separately.
In addition to hypothesis testing, the one-sample statistics can also be
used to find confidence regions for the continuous population c.dJ. F; these
regions might well be called nonparametric. Let D:'", be the critical value of
Dn+ at level c(. Then with confidence coefficient 1 - C(, F(x) everywhere lies
on or above the lower confidence bound given by (Problem 40)
Ln(x) = max[Fn(x) - D:'"" 0]. (7.4a)
Similarly, an upper confidence bound is
(7.4b)
A two-sided confidence band with confidenc~ coefficient C( is the region
between Ln(x) and Un(x) with D:'", in (7.4a) and (7.4b) replaced by the level
C( critical value of Dn; of course, Dn~ ",/2 could also be used, but it is ordinarily
slightly larger (Problems 41 and 42). The simplest procedure in application
is to graph the observed empirical distribution function Fn as a step function
and plot parallel step functions at a distance Dn~'" (or Dn,,,,) above or (and)
below Fn as appropriate but truncate them at 0 and 1. A c.dJ. would be
"accepted" as a null hypothesis when tested by a Kolmogorov-Smirnov
test procedure if and only if it lies entirely within the corresponding band.
This confidence property is quite special. The set of all "acceptable"
c.dJ.'s is so simply describable only for tests based on statistics of the Kolmo-
gorov-Smirnov type (including certain generalizations discussed by Ander-
son and Darling [1952]). No such simple description is possible for test
statistics based on other measures of distance between c.dJ.'s, such as the
Cramer-von Mises statistics (see Problem 57). In the framework of simul-
taneous inference, one might start with confidence regions for F(x) based
336 7 Kolmogorov-Slmrnov Two-Sample Tests

on Fn(x) at each x and seek an adjustment to make them valid with overall
(simultaneous) confidence 1 - 0(. If we start with the confidence regions in
the form IF(x) - Fn(x) I : : ; c, then the two-sided Kolmogorov-Smirnov
confidence band results. If we start with IF(x) - Fn(x) I : : ; caW[F(x)] for an
arbitrary function W, such as W[F(x)] = JF(x) [1 - F(x)], then a different
band results, corresponding to the statistic sUPxIFn(x) - F(x)I/W[F(x)].
W determines the relative emphasis on short confidence intervals in different
portions of the distribution. See Anderson and Darling [1952] for further
discussion.
The Dn or Dn+ statistic also provides procedures for determining the
minimum sample size required to state with a predetermined probability
1 - 0( that if F(x) is estimated by Fn(x), the error in the estimate will nowhere
exceed a fixed value e. For instance, we can use tables of the null distribution
of Dn to find the smallest integer n such that P(Dn < e) ;:::: 1 - 0( (or P(D: < e)
;:::: 1 - O(). This is the minimum n for which the two-sided (or one-sided)
confidence bands described earlier lie within e of the empirical distribution
function.

PROBLEMS

1. Let Fm be the empirical distributIOn function of a random sample X I, ... , XIII


from a population with c.dJ. F. Show that
(a) mFm(t) is binomial with parameters m and F(t).
(b) Fm(t) has mean F(t) and variance F(t)[1 - F(t)]/m.
(c) Fm(t) is a consistent estimator of F(t) for each t.
(d) F",(t) is asymptotically normal with the mean and variance in (b).
(e) F",(t I) and F",(t 2) have covanance a(t I, t 2) = {mm[F(t I), F(t 2)] - F(t I)F(t 2) }/m.
(f) Fm(t d, ... , Fm(tk ) are asymptotically multivariate normal with the means,
variances, and covariances given in (b) and (e), for each t I, ... , tk •
*2. (a) In the situation of Problem 1, show that Fm(t) converges uniformly to F(t) in
probability, as defined in (d) of Sect. 2, or (c) below. (A proof can be based on
consistency and monotonicity alone.)
(b) Show that the result in (a) is equivalent to sup, IFm(t) - F(t) I -+ 0 in prob-
ability.
(c) Let" Zm(t)-+O uniformly in t, in probability" mean that for all 8>0, P[IZm(t) I <8
for all t] -+ 1 as m -+ 00. Let" Zm(t) -+ 0 in probability, uniformly in t" mean
that for all 8 > 0, P[IZm(t)1 < 8] -+ 1 uniformly in t as m -+ 00. Show by
example that the second condition is weaker than the first.

3. Use the probability integral transformation to show that the Kolmogorov-Smirnov


two-sample statistics are distribution-free for samples which are independently,
continuously, and identically distributed.
4. Use the algorithm of Sect. 4.1 to show that the P-value of Dmn is 0.3429 for the
sample arrangement X Y Y X X Y Y.
5. Show that the value of Dmn is always an integer divided by the least common multiple
of III and n.
Problems 337

6. In the notation of Section 4.1, show that for 0 :::; u :::; m,O :::; v :::; n, 0 :::; k :::; m + II,
no
A(m, II) = L A(u, k - u)A(m - u, II - k + u).
u=o
The range of II can be restricted to max(O, k - II) :::; II :::; minCk, Ill). Using this
expression for A(m, n) with k = (m + 11)/2 or (m + n + 1)/2 permits the recursion
to be terminated at 1/ + V = k. (See Hodgcs [1957].)
7. What change is required in the definition of A(u, v) in Section 4.1 to obtain the exact
null distribution of D,!" ?
8. Use the result of Problem 7 to find the P-value of D;:;" for the sample arrangement
X y Y X X Y Y (the same arrangement as Problem 4).
9. Show the symmetry relations (4.5) for the null distributions of the one-sided
Kolmogorov-Smirnov statistics.
10. Assume that all possible arrangements of two samples have positive probability.
Show that
(a) P(Dm" ~ c) = P(D;:;" ~ c) + P(D;;'" ~ c) - P[min(D;:;", D;;'") ~ c].
(b) D,!" and D;;," cannot both exceed 0.5.
(c) If m or n is even, samples exist with D;:;" = D;;," = 0.5.
(d) If m and II are both odd, the largest c for which mineD;:;", D;;'") ~ c is possible
and hence the largest c for which P(Dm" ~ c) < P(D;:;" ~ c) + P(D;;'" ~ c) is
c' = 0.5 - [0.5/max(m, n)].
(e) For m = 5, n = 7, 16/35 is a possible value of D;:;" but 17/35 is not. Since c' =
6/14 = 15/35 here, this illustrates that values of D;:;" between c' and 0.5 of the
form kiM, where k is an integer and M is the least common multiple of m and n,
are sometimes possible and sometimes impossible.
(f) If c is the critical value of D;:;" at exact level (1. and if c > 0.5 or m and n are both
odd and c > 0.5 - [0.5/max(m, n)], then c is the critical value of D no" at exact
level 2(1.. This is not always true if the word exact is omitted.
1l. Let PI be the larger of the two one-tailed Kolmogorov-Smirnov P-values and P2
be the two-tailed P-value. Show that 0 :::; 2P 1 - P 2 :::; 2Pt
(a) Asymptotically, that is, when the right-hand sides of (4.21) and (4.22) are used
for PI and P 2 respectively.
*(b) For III = II. (HIIlt. Show that 2P 1 - P2 :::; 2(,,~/~k)/(~:'):::; 2Pt = 2[(,,21'k)/(~:')J4,
where PI and P 2 are given by (4.7) and (48), by showing that

[1- ~J
+
[1 -2kJ : :; [1 - ~J4
III k III m
for 1 :::; k :::; Ill. See (4.14).)
12. Define" k is between c and d" as meaning c :::; k :::; d or d :::; k :::; c. In the notation
of Section 4.2, show that
(a) NU, k, l) = N( -j, -k, -I).
(b) NU, k, l) = NU, I) - N(k,j, l) if k is betweenj and 21.
(c) N(notj, k, /) = N(k,j, I) ifj is between k and 21.
(d) NU, k, /) = NU, k - I) - N(k,j, k - /) if k is betweenj and 2k - 21.
(e) N(notj, k, /) = N(k, /) - N(not k,j, k - /) if k is betweenj and 2k - 21.
(f) N(not j, k, /) = N(not j, k, k - /).
(g) N(notj, k, I) = NU, k - I) - N(not k,j, k - l) ifj is between k and 2k - 21.
(h) k is between j and 2k - 21 if and only if k is not between j and 21.
338 7 Kolmogorov-Smlrnov Two-Sample Tests

13. Derive the formula for the number of paths which satisfy D•• ;::: kin using
(a) Part (c) of Problem 12.
(b) Part(d) of Problem 12.
14. At (4.12), if the portion of the path to the left of the point where the path reaches
height k is reflected, instead of the portion to the right, what result is obtained and
how does it relate to (4.12)1
15. Let N(i,j, k, l) and N(i, notj, k, l) be the numbers of paths starting at height i, reaching
height k after, and without, respectively, having first reached heightj, and termina-
ting at height /, where the steps are diagonal as in Sect. 4.3. Give properties of N
like those given in Sect. 4.3 and Problem 12. Assume that the sum (i + / + the
number of steps) is even. (What happens if this sum is odd 1)
16. The exact null probability distribution of any of the Kolmogorov-Smirnov two-
sample statistics can always be determined by enumerating all possible arrange-
ments of the X and Y observations and computing the value of the statistic for each.
Enumerate the m
arrangements for m = n = 3. Determine the exact null distribu-
tions of Dm. and D:., both from the enumeration and from (4.7) and (4.8).
17. Show that the Mann-Whitney statistics can be expressed in terms of the empirical
distribution functions as follows:
no
U = nI G.(X,), U' = m I Fm(lj)·
.==1 J= 1

18. Define the indicator variables I k = I if the kth smallest observation in the combmed
ordered sample of m X's and II Y's is an X, and Ik = 0 otherwise, k = 1,2, ... ,
m + n = N. Show that the two-sample Kolmogorov-Smirnov statistics can be
expressed as

D,;,,, = !!-. max [f!1~


mil ISJSN N
- •,I=}Ilk]
D",,, = -N max - - Ijm Ilk
} I
mIlIS}s,v N k=l

and ifm n,
±Ik]
=

D:" =! max [j - 2
nl:S}:S2. k=l

D•• =! max
n 1 :S J :s2.
Ij-2±Ikl.
k=1

19. Show that the one-sided, two-sample Kolmogorov-Smirnov statistics can be


expressed [e.g., Steck, 1969] in terms of the X ranks r, and the Y ranks rj, where
,., is the rank of the ith smallest X in the combined sample and r~ is that of the jth
smallest Y, as
mnD,;'. = max (mr, - Ni + /I) = max (Nj - nr~)
1 s):sn

/linD,;'. = max (Ni - /II,.,) = max (nr~ - Nj + /II).


Problems 339

20. (a) Express the one-sided, two-sample Kolmogorov-Smirnov statistics in terms


of ai' ... , am + I where aJ is the number of Y's between XU-I) and X(J) for j =
2, ... , m and al and am + I are the number of Y's below X(1) and above X(m)
respectively.
(b) What is the distribution of (a " ... , am + I) under the null hypothesis?
*21. Show that for m = n the mean ofthe two-sample, one-sided Kolmogorov-Smirnov
statistic D~n under the null hypothesis is

*22. Show by elementary considerations that P(D"" ~ 1/11) = I always, and show that
(4.8) gives the same answer under the null hypothesis. (Hint· Manipulations are hke
Problem 21.)
*23. Show that the null distribution of D~" can be obtained by the following recursive
procedure. Let Vi be the smallest integer greater than or equal to (ni/m) + ne for
0:::;; i :::;; I where I = [m(l - c)] is the greatest integer in m(l - e), and N, = 1 + V"
Let Mo = 1 and

MI = (NI) _ '~I M (NI - NJ)


• L,).., i = 1, ... , I.
1 j=O 1 - J
Then

+
P(D mn ~ c) = ~
L...M, (N - NI)j(N)
. .
i=O m- 1 m
(Hint: In the representation of Sect. 4.1, the N,th step is the first to reach the bound-
ary and (i, v,) is the point reached. Altogether there are

paths to this point, of which M, reach the boundary first at the last (N,th) step and

M,(NI-J~I) I• -

at the N}h step. There are (~) paths to (m, 11), of which

M.
I
(Nm-- N)
i
I

reach the boundary first at the Njth step. For calculation, choosing m :::;; /I mini-
mizes the number of terms, and replacing by N, Ni -
1 (without changing in N)
the recursive formula for M, reduces their size somewhat and can be justified by
observing that M, is the number of paths to (i, Vi - 1) not reaching the boundary.)
See Hodges [1957], Korolyuk [1955], and Steck [1969].

24. Show that the one-sided, two-sample Kolmogorov-Smirnov statistics can be


expressed as

+
Dmn = {j i
max - - -:
I,) 11 m
X(I+I) > l(J) } , _ {i
Dmn = max - - j-:
I.J m n
l(,+1l > X(I)
}•
340 7 Kolmogorov-Smirnov Two-Sample Tests

25. Show that the confidence bounds for a shift parameter corresponding to the two-
sample Kolmogorov-Smirnov tests with critical value c (rejecting for D':;. ~ c,
D;;;. ~ c, or Dm. ~ c) are as follows. The upper bound is

where i) is the smallest integer exceeding (mjfn) - me and j, IS the smallest integer
not less than [n(i - 1)/m] + nco The lower bound is

~ax {l(j) -
I,)
X(I): ~-
m
j - 1
n
~ c} = max {l()) -
,
X(k})}

= max {l(,,, - X(I)}


i

where k, is the smallest integer not less than [mU - l)1n] + me and I, is the smallest
integer exceeding (ni/m) - nco
26. Compare the confidence bounds for a shift parameter corresponding to the two-
sample Kolmogorov-Smirnov tests (Problem 25) with critical value c ~ 1 -
[1/min(m, n)] to those corresponding to the rank sum test.

27. Show that the largest possible values of the one-sided, two-sample Kolmogorov-
Smirnov statistic D':;. with m ::;;; n and the associated tail probabilities under the
null hypothesis can be expressed as follows where k is the largest integer III n/m:
(a) P(D':;. = 1) = 1/(~) for all m ::;;; n.
(b) P(D':;. ~ 1 - l/n) = (m + 1)/(~) for m < n.
(c) P(D,;'. ~ 1 - l/m) = N/(Z,) for m ::;;; n < 2m.
(d) P(D':;. ~ 1 - i/n) = (m;:;-')f(z,) for 0 < i < n/m.
(e) P(D':;. ~ 1 - l/m) = [(m':;k) + n - k]/(~).
(f) P(D':;. ~ 1 - i/n) = [(m;:;-') + (m+~=~-I)(n - i)]/(~) for n/m < i < 2n/m.
(g) P(D':;. ~ 1 - l/m - i/n) = [(m+'::+I)+(m;:;-~11)(n - k - i)]/(Z,) forO < i < n/m.
28. Show that the asymptotic null distribution of (2mn/N)(D,;,,,)2 is exponential and
that of (4mn/N)(D':;.)2 is chi-square with 2 degrees of freedom.

*29. Show that P(D m• ~ c) and P(D':;. ~ c) are strictly smaller for discontinuous than
for continuous F for all c, 0 < c ::;;; 1, when all observations are independent with
c.d.f. F. (Hint: Consider the possibility that all observations are tied.)
30. Show that
(a) The Kolmogorov-Smirnov statistics are discontinuous functions of the
observations.
(b) The Kolmogorov-Smirnov statistics are lower-semicontinuous, where hex)
is defined as lower-semicontinuous at Xo if, for every e > 0, there is a neigh-
borhood of Xo whereJ(x) ~ J(x o) - e.
*(c) The level of any test for discontinuously distributed observations is no greater
than its level for continuously distributed observations if the acceptance region
of the test is of the form T :s; c where T is a lower-semicontinuous function of
the observations.
Problems 341

31. In the situation and terminology of Sect. 6, show that Gil - F", is a strongly con-
sistent estimator of G - F and use this to show that the three two-sample
Kolmogorov-Smirnov statistics are consistent estimators of the corresponding
population quantities.
32. Show that the one-sided and two-sided Kolmogorov-Smirnov two-sample tests
are consistent against the alternatives stated in Sect. 6.
33. Derive the asymptotic lower bound (6.2) on the power of the one-sided two-sample
Kolmogorov-Smirnov test.
34. In the situation of Sect. 6, show that
(a) The power of the one-sided Kolmogorov-Smirnov test is at most the null
probability that D;:;. ~ cm••• - 8.
(b) For large m and n, the probability in (a) is approximately
exp{ -2[max(c. - 8 Jmn/N, OW}.
(c) The two-sided Kolmogorov-Smirnov test with critical value Cm••• has power
at least
P[G.(Xo) - Fm(xo) ~ Cm•.• ] + P[G.(xo) - Fm(xo) ~ -cm•.• ].
(d) For large m and n the quantity in (c) is approximately
11>[(8 - c.JN/mn)/(J] + 11>[( -'8 - c.JN/mn)/(J]
~ 1I>[2(8Jmn/N - c.)] + 1I>[2( -8Jmn/N - c.)],
where c. is the value of A. for which (4.18) equals ()(.
*(e) Parts (c) and (d) and (6.1)-(6.4) are valid for any x o, with 8 = G(xo) - F(x o).
What choice of Xo gives the tightest lower bound in (6.3), (6.4), and part (d)?
35. Show that the Kolmogorov-Smirnov one-sample tests are consistent as stated in
Sect 7.
36. Show that the Kolmogorov-Smirnov one-sample tests are biased. (Hint: Let Z
be a function of X such that X < Z < a for X < a, X > Z > b for X > b, and
Z = X otherwise, where Fo(a) = 1 - Fo(b) = critical value of the test statistic.
An X sample rejects whenever the corresponding Z sample does, but not con-
versely.) [Massey, 1950b].
*37. Show that the Kolmogorov-Smirnov one-sample statistics are stochastically
smaller under the null hypothesis when F 0 is discontinuous than when it is con-
tinuous, and hence that the P-values and critical values that are exact for the
continuous case are conservative in the discontinuous case.
38. Show that the one-sample Kolmogorov-Smirnov statistics are distribution-free
for a sample from a population with a continuous c.dJ. Fo(x). (Hint: Let U =
Fo(X).)
39. Show that the two-sample Kolmogorov-Smirnov statistics approach the cor-
responding one-sample statistics as one sample size becomes infinite. Define the
type of convergence you use.
40. Show that the lower, upper, and two-sided confidence bands defined by the critical
values of the Kolmogorov-Smirnov one-sample statistics each have probability
at least 1 - ()( of completely covering the true c.dJ. sampled, whatever it may be.
342 7 Kolmogorov-Smlrnov Two-Sample Tests

41. Show that a one-sample Kolmogorov-Smirnov test would "accept" the c.dJ.
F 0 If and only If F 0 hes entirely within the correspondmg confidence band. Assume
the same critical value is used for F 0 discontmuous as for F0 continuous.
42. For F 0 continuous, show that
(a) The P-value of D. is twice the smaller of the two corresponding one-sided
P-values if D. ~ 0.5, and less than twice if 0 < D. < 0.5.
(b) The critical value D•. a = D:. a/ 2 if D:. a/ 2 ~ 0.5. Otherwise D•. a < D:. a/ 2 •
43. Use a symmetry argument to show that the null distribution of D;; is identical to
that of D:.
44. Show that, in the definitions (7.1)-(7.3) of the Kolmogorov-Smirnov one-sample
statistics, the supremum is always achieved for D,; but may not be for D;; or D•.
45. Show that for F 0 continuous the Kolmogorov-Smirnov one-sample statistics
defined m (7.1)-(7.3) can be written as
D: = max Win) - F O(X(I)J
15iSn

D;; = max [Fo(X(I) - (i - 1)/nJ


1:$1.$;11

46. Show that under the null hypothesis F = F 0 for F 0 continuous, the null distribution
of the one-sample Kolmogorov-Smirnov statistics can be expressed as follows,
where U I < U 2 < ... < U. are the order statistics of a sample of size n from the
uniform distribution on (0, 1).
(a) P(D: ~ c) = P[U, ~ (i/n) - e for i = 1, ... ,11]'

(c) P(D: ~ c) = n! rf
(b) P(D" ~ c) = PW/n) - e ~ U , ~ W - 1)/nJ + e for i = 1, ... , n}.

an a,,-1
Un ••• f3 f"z
Q2 a1
dU I ••• duo where a, = (ifn) - e for i > ne,

a, = 0 otherwise, and 0 ~ e ~ 1.
l -(1/.)+< fl-(2/.)+< f(1/·)+< f<
(d) P(D. ~ c) = f ... f(u l , ... , u.)dul ... duo
1-< 1-(1/n)-< (2/.)-< (i/n)-<

where f(lI l , ••• , u.) = n! for 0 < UI < ... < u. < 1 and 0 otherwise, and

rf
1/211 ~ e ~ 1.

(e) P(D. ~ c) = 11! n


-
,
••• fb' dill· ··du" where aj is defined in (c), bj =
an an - 1 al

min{[(i - l)/nJ + e, u,+ d, and 1/2n ~ e ~ I.


47. Illustrate the use of the integral in Problem 46(c) by evaluating P(D: ~ c) for all e
when II = 2. Use this result to show that the upper-tail critical value of is 0.776 D:
for n = 2, C( = 0.05.
48. Illustrate the use of the integral in Problem 46(d) or (e) by evaluating P(D" :::;; c)
for all e when II = 2. Use this result to show that the critical value of D. is 0.8419
for 11 = 2, C( = 0.05.
Problems 343

*49. Let F m be the empirical distribution of a sample of size m from a population with
continuous c.dJ. F. Let b be a constant, b ~ I.
(a) Verify that P[F,.,(t) ~ bF(t) for all t] = I - (lib) for III = I and //I = 2
(b) Prove the result in (a) for all m. (Hint: Take F uniform on (0, I). Let Z be the
sample maximum. Given Z, the remainder of the sample is distributed uniformly
on (0, Z). The result for m - I implies that P[Fm(t) ~ bt for 0 < t < Z IZ] =
I - [(//I - 1)/mbZ]. The remaInIng requirement for the event to occur IS
Z ~ lib.) ThiS result is due to Daniels [1945], and a special case IS given 111
Dempster [1959].
*50. Show that the null distribution of D;; can be expressed in the following form due
to Birnbaum and Tingey [1951].

P(D;; ~C)=(I-C)n+e[n~~:))(;)(I-e-~r-j(e+~)'-I.
(Hint: Referring to Problem 46(a), let the last i for which U, < (i/n) - e be n - j.
Then exactly n - j of the U's are smaller than 1 - e - U/n), and this has probability
(j)[1 - e - U/n)]n- l[e + U/n»)1. Furthermore, the remaining U's are condition-
ally uniformly distributed on the interval [I - e - U/n), I] and at most k of
them are in [I - e - U/n), 1 - C + {(k - j)/n)]. By Problem 49, 'this has con-
ditional probability 1 - W/n)/[e + U/n)]} = e/[e + U/n)]. Multiplying these two
probabilities gives the jth term of the sum, which is the probability that the (n - j)th
order statistic is the last at which the empirical c.dJ. exceeds the upper bound.
See also Chapman [1958], Dempster [1959], Dwass [1959], Pyke [1959], and
Problem 52.)
51. Venfy directly, from both Problem 46(c) and Problem 50, that under the null
hypothesis, P(D;; > 0.447) = 0.10 for n = 5.
*52. Generalize the Birnbaum-Tingey formula in Problem 50 to a formula for the null
probability that sup, [F m(t) - bF(t)] ~ e for arbitrary constants e > 0 and b >
1 - e. See the references cited in Problem 50.
*53. Derive the asymptotic null distribution of D;; from the Birnbaum-Tingey formula
in Problem 50 (Dempster [1955]).
54. Under the null hypothesis show that
(a) D;; is uniformly distributed on (0, I) for n = 1.
(b) The density of D;; for n = 2 is
I + 2x o~x~t
h(x) = { ~(I - x) t::o;x::O;l
otherwise.
55. Let F m be the empirical distribution function of a random sample of size III from the
uniform distribution on (0, I). Define
Xrn(t) = In[Frn(t) - t]
Zrn(t) = (t + I)X,.[t/(t + I)]
for all 0 ~ t ~ 1.
(a) Find E[ X rn(t)] , E[Zm(t)], var[ X n,(t)] , var[Zm(t)], cov[X rn(t), X m(u)], cov[Zrn(t),
Z,.,(u)], for all 0 ~ t ~ u ~ 1 and all m.
344 7 Kolmogorov-Smirnov Two-Sample Tests

(b) What regions for ZII/ correspond to the regions D/~ < c and Dill < c for the
Kolmogorov-Smirnov one-sample statistics?
56. (a) Under the null hypothesis that a sample comes from a normal population,
consider the Kolmogorov-Smirnov one-sample statistics with parameters
estimated by the sample mean X and standard deviation s, namely, D: =
sup, {F.(t) - <I>[(t - X)/s]} and D. = sup, 1F.(t) - <I>[(t - X)/s] I, where <I>
is the standard normal c.dJ. Show that their null distributions do not depend
on the mean and variance of the normal population.
(b) Give an analogous result for the null hypothesis of an exponential population.
(c) Does an analogous result hold for all parametric null hypotheses?
57. Measures of the distance between F m and G. other than their maximum difference
can also be used as distribution-free tests of the null hypothesis F = G that two
contmuous distributions are Identical. The one called the Cramer-von Mises
statistic is defined as

OJ!. = '; f:", [F met) - G.(t)] 2 d[mF,.(t) ; lIG.(t)].

(a) Prove that OJ;,. is distribution-free.


(b) What is the appropriate critical region for the alternative F # G.
(c) Show that OJ!. can also be expressed as I:=
1 Wm/N) - B=
1 Ik]2/nm where
Ik = 1 if the kth smallest observation in the combined ordered sample is an X,
and Ik = 0 otherwise.
(d) Express OJ;. in terms of the ranks of the X's and Y's in the combmed sample.
See Cramer [1928], von Mises [1931], Anderson and Darling [1952], and
Darling [1957] for further properties of OJ;. and other related measures of the
distance between Fm and G•.
CHAPTER 8
Asymptotic Relative Efficiency

1 Introduction
In any given inference situation, many statistical procedures may be avail-
able, both parametric and non parametric. Some measure of their relative
merits is needed, especially as regards their performance or operating charac-
teristics. For instance, a comparison of the power functions of various tests
of the same (or essentially the same) hypotheses would be of interest. It is
frequently more convenient, and also more suggestive, to use a measure of
relative merit called the relative efficiency.
The relative efficiency of a procedure 0 1 with respect to a procedure O2
is defined as the ratio of sample sizes needed to achieve the same perform-
ance, namely n2/n1' where n2 is the number of observations required to
achieve the same performance using O 2 as can be achieved using 0 1 with
n1 observations. Thus, in particular, the relative efficiency of 0 1 with respect
to O 2 is less than or greater than 1 according as O2 requires fewer or more
observations than 0 1 to achieve the same performance.
The use of this definition poses certain problems. For one thing, the
comparison procedure O 2 , at least, must be defined for each possible sample
size n2' Thus it should really be regarded as a sequence of procedures, one
for each sample size. The fact that n2 is not a continuous variable poses
another slight problem, because typically there will be no integer n2 for which
the performance of O 2 , in some specified respect, exactly matches the per-
formance of 0 1 with n1 observations.
The main problem, however, is that there are many ways to specify what it
means to "achieve the same performance." Each specification will produce
its own value of n2 and hence of the relative efficiency. This multiplicity of
values can confuse comparisons.

345
346 8 Asymptotic Relative Efficiency

In testing, for example, one could ask that O2 achieve the same power as
0 1 for any specific alternative, or the same average power for some weighted
average over a set of alternatives. The relative efficiency, sometimes called
the power efficiency in this case, is thus a function of all those variables
which determine power, including n l and IX as well as the alternative dis-
tribution. Usually, however, it varies much less than the power as a function
of these variables. Consequently, the relative performance of two tests can
usually be described much more concisely in terms of relative efficiency than
in terms of power functions directly. In some cases, conveniently, the relative
efficiency is approximately constant, so that the entire comparison reduces to
a single number.
Sometimes the relative efficiency has a lower bound close to one every-
where in the range of importance. If, for example, the relative efficiency of a
non parametric test with respect to a parametric test is never much below 1,
the non parametric test is wasting at most some small fraction of the ob-
servations. One might then feel that the advantages, like simplicity and
broader applicability, of the non parametric test outweigh this small waste.
Such a conclusion should not be reached casually, however, without serious
assessment of the real value of such advantages and of increasing power.
For point estimation, relative efficiency is usually defined as the ratio of
sample sizes needed to achieve the same variances or mean squared errors
of the estimators. Other functions of the error (other" loss" functions) could
be used instead of the square for matching. Matching one function does not
ordinarily match others or the error distributions exactly. The relative
efficiency of two estimators also depends on the assumed distribution and
nl' just as for tests it depends on nl and the alternative where the power is
matched.
For confidence intervals, relative efficiency might be defined by matching
expected lengths. This would not entirely match the distributions of the
endpoints, of course. The probability of covering some specified false value
could be matched instead, and the result would then be a function of the
false value used. In any case, the relative efficiency of two confidence pro-
cedures will depend on the confidence level, as well as the true distribution
and n l •
Since relative efficiency generally depends on so many factors, its implica-
tions may be difficult to assess and interpret. This problem often disappears
conveniently when limits are taken. The asymptotic relative efficiency of a
procedure 0 1 with respect to a procedure O 2 is defined roughly as the limit
of the relative efficiency as nl --+ 00. Here, of course, both 0 1 and O2 must
be sequences of procedures, defined for arbitrarily large sample sizes n l
and n2 • It would seem that the limit required in this definition might well
fail to exist, and that when it does exist it might depend on essentially the
same variables as the relative efficiencies for finite sample sizes. We shall see,
however, that things become much simpler as the sample sizes approach
infinity, and a single number will describe a great many features of the
2 Asymptotic BchavlOr of Tests HeurIstic DIscussIOn 347

asymptotic relative behavior of two procedures. Because of this very great


convenience, asymptotic relative efficiency is widely used for comparisons
of procedures, even though it is only a large-sample property.
This chapter is devoted to a study of asymptotic relative efficiency. We
first investigate heuristically what typically happens as the sample sizes
approach infinity in the case of tests, point estimates and confidence bounds
(Sects. 2-4 respectively, followed by an example in Sect. 5). We will then be
in a position to list the specific properties of asymptotic relative efficiency
(Sect. 6) and explain how it is calculated (Sect. 7). The theory will then be
illustrated (Sect. 8) by applying the results to the one-sample procedures
which were discussed in Chaps. 2-4. Specifically, we will derive numerical
values of the asymptotic relative efficiency for various pairs of procedures
for some selected families of distributions. Thereafter we will discuss matched
pairs (Sect. 9), the two-sample shift problem (Sect. 10), and finally the
Kolmogorov-Smirnov procedures (Sect. 11), which behave differently from
the others and pose different technical problems.

2 Asymptotic Behavior of Tests: Heuristic Discussion


The object of this section is to explore, without worrying about rigor, how
typical tests behave in large samples. We consider the power of an individual
test as well as the relative performance of two tests. The development will
suggest various natural definitions of asymptotic efficiency of one test rela-
tive to another, and will show that we can expect a certain formula to apply
to these definitions, and also to the corresponding point estimators and
confidence bounds. The calculations throughout will be approximate,
but the approximate statements should become exact in the limit as the
sample sizes approach infinity. The discussion will be heuristic, with no
formulations or proofs of any precise statements about the limits which
correspond to the approximations presented.

2.1 Asymptotic Power of a Test

In order to investigate the asymptotic relative efficiency of two tests, it is


necessary first to learn something about the asymptotic behavior of each
test individually. Accordingly, this subsection is concerned with approxima-
ting the power of a single test in large samples. Consider a one- or two-sided
test based on a statistic 7;" which rejects for large values of 7;" small values
of 7;" or both, as the case may be. (The subscript n is included to indicate the
dependence on the sample size.) The power depends on the alternative
distribution under discussion, and we shall need to consider more than one
alternative at a time. Let us consider a family of distributions depending
348 8 Asymptotic Relallve EfficIency

on a one-dimensional parameter 0, where 00 denotes a distribution be-


longing to the null hypothesis. For instance, we might be interested
in the behavior of the ordinary sign test against normal alternatives. Then
T" could be defined as the number of negative observations, 0 as the true
mean, and 00 as 0, with the true standard deviation fixed. (Alternatively, 0
could be defined as the mean divided by the standard deviation, since that
is all that matters here.)
Suppose that the test statistic T" is approximately normal with mean
1l"(0) and standard deviation (J"(O) when the true distribution is given by O.
In particular, then, under the null hypothesis, T" is approximately normal
with mean 1l,,(00) and standard deviation (J"(00)' Accordingly, the rejection
region of an upper-tailed test based on T" is approximately
(2.1)
where Za is the upper (/. point ofthe standard normal distribution, the number
such that 1 - <I>(za) = (/. for <I> the standard normal c.dJ. (The development
for lower-tailed and two-tailed tests is similar and is left as Problem 1.) The
power of this test, at any alternative 0, is the probability of rejection, which is
approximately the probability of (2.1). Since the test statistic T" is approx-
imately normal with mean 1l"(0) and standard deviation (J"(O), the probability
of (2.1) under 0 is approximately the probability that a standard normal
random variable exceeds the value
1l,,(00) + zu.(J"(Oo) - Iln(O)
(2.2)
(J,,(O)
In terms of the standard normal c.dJ. <1>, this probability can be written
<I>[Il,,(O) - Iln(Oo) _ Z (J,,(00)] (2.3)
(J,,(O) a (J,,(O) .
Thus the power of the test against 0 is given approximately by (2.3).
Now we are ordinarily interested in alternatives against which the test
is consistent, so that the power approaches one as n -t 00 for 0 fixed. (This
presumably restricts 0 to one side of 00 for a one-tailed test.) Then the
approximation (2.3) for fixed 0 says only that the power is approximately
one. It is therefore appropriate to consider not a particular alternative 0,
but a sequence of alternatives 0" which approach 00 as n -t 00. Now if
Il" is differentiable at 00 and (In is continuous, then for On near 00 we have
the approximations
(2.4)
where 1l~(00) = dll,,(O)/dOlo=oo' Substituting (2.4) into (2.3), we find that the
power of the test against 0" is approximately equal to

<1>[(0" - ( 0)~:~~:~ - ZaJ. (2.5)


2 Asymptotic Behavior of Tests: HeurIstic DIscussIOn 349

In particular, the rate at which (In --. (Jo can be adjusted so that the argument
of II> is finite. Furthermore, to a first approximation, the power function of
the test is completely determined by the quantity Jl.~«(JO)/un«(Jo). Notice that
the variance u;(lJ) is needed only under the null hypothesis (J = (Jo'
As an example, consider the power function of the ordinary sign test
against normal alternatives. Let T" be the number of negative observations.
If the true distribution of the population is normal with mean Jl. and variance
u 2 , then the probability of a negative observation is
p = 11>( - (J/u). (2.6)
At (J = (Jo = 0, the null hypothesis p = t is satisfied. T" is binomial with
parameters p and n, and hence is approximately normal with mean and
variance given by
Jl.n«(J) = np = nll>( - (J/u) (2.7)
u;«(J) = np(l - p) = nll>( -(J/u)[l - II>(-(J/u)]. (2.8)
Letting 4J denote the standard normal density, we calculate
Jl.~«(Jo) = n( -l/u)4J(O) = _ 2J1i4J(0)ju = _ J2n/n/u. (2.9)
un«(Jo) J;p,
Substituting (2.9) in (2.5) then gives

11>[ -J2n/n«(Jn - (Jo)ju - za]


as an approximation to the power function of the sign test against normal
alternatives.
In this example, Jl.~(lJO)/un(lJO) is of order In.
This is typical, as we shall
see from other examples. For purposes of relative efficiency, it would be
convenient to have a quantity of order n. Accordingly, we introduce the
square of the foregoing ratio, namely
(2.10)

The quantity en is, in general, called the efficacy of the test statistic T" for the
family of distributions in question (at (Jo). As defined here, it depends on the
choice of the approximations Jl.n and Un' which are not unique, and on the
choice among equivalent test statistics. As we shall see, however, these choices
have only a second-order effect, and the efficacy of a test is uniquely defined
to a first order of approximation.
In taking the square, the sign of Jl.~«(JO)/un«(JO) is lost, but this is unimportant
for present purposes. The sign is always the same as the sign of Jl.~«(Jo) and
merely indicates whether large values of T" are associated with large or small
values of (J. For example, the negative sign in (2.9) corresponds to the fact
that for normal distributions the number of negative observations tends to
be large when the mean (J is small, which implies that a one-tailed test reject-
ing when there are too many negative observations is appropriate against
350 8 Asymptotic Relative Efficiency

the alternative 0 < 0, not the alternative 0 > O. Provided the appropriate
one-tailed test is used for a one-sided alternative, the sign of 1l~(00) is of no
consequence.
The efficacy en therefore contains exactly the information we need here,
and conveniently is typically of order n. In terms of en, by (2.5), the power of the
appropriate one-tailed test against the alternative 0 > 00 , can be expressed
approximately as
(2.11)

Similar expressions apply to the alternatives 0 < 00 and 0 # 00 ,


This indicates that, in large samples, the power functions of all typical
tests are approximately the same except for a scale factor .je" which depends
on the test statistic and the family of alternative distributions. Since en --.. 00,
a graph of the approximate poWer (2.11) as a function of 0 without attention
to scale would show for large n only that the power is approximately one for
the entire alternative hypothesis 0 > 00 , This would tell us nothing, except
that the test is consistent. However, if we rescale 0 according to sample size,
using bn = (0 - Oo).je" for sample size n, and plot the power as a function
of bn , then we get approximately $(bn - z.. ) and we can see what the power
function really looks like as n --.. 00. This function appears in Fig. 2.1 for z.. =
1.64, as the curve labeled k = 1. It describes the large-sample power, suit-
ably rescaled, for all typical one-tailed tests with IX = 0.05.

0.5 1.0 1.5 3.5

Figure 21
2 Asymptotic BehavIOr of Tests: HeuristIc DIscussion 351

We are, of course, not recommending formulas (2.5) and (2.11) as good


approximations for actual calculation of the power. Using the normal
approximation leads naturally to formula (2.3), but it is seldom necessary
to make the additional approximation in (2.4) which leads to (2.5) and (2.11).
Equations (2.5) and (2.11) are introduced here primarily to show how the
power function behaves as n ....... 00. This facilitates an intuitive understanding
of the situation and will also be useful later.
We may also mention that, if the exact mean and variance are used in
the definition (2.10), the efficacy is less than or equal to the Fisher informa-
tion, and that this inequality is a form of the famous Cramer-Rao inequality
(Problem 2). The asymptotic efficiency of one-parameter maximum likeli-
hood estimators follows heuristically (Problem 3).

2.2 Nuisance Parameters

The problem of nuisance parameters will be introduced by means of an


example. Consider a situation where the test statistic T" is the sample mean,
and the population is normal with known variance u 2 • If the null hypothesis
is that the mean e has a specified value eo, then the approximations of the
previous subsection are exact with
Jlie) = e, u;(e) = u 2 In, (2.12)
The normal approximation is exact because T" is exactly normal, and (2.4),
(2.5) and (2.11) are exact because in addition Jln(e) is exactly linear and uie)
is exactly constant.
If the assumption of normality under the null hypothesis is eliminated,
the only change is that the level of the test becomes approximate. That is,
for the null hypothesis that the population has mean eo and known variance
u 2 , the same test has level approximately IX and its power against normal
alternatives with variance u 2 is the same as before.
Now suppose we drop the assumption that u 2 is known under the null
hypothesis. In this case, the sample mean itself cannot serve as a test statistic,
even for an approximate test, because its distribution under ,\he null hypo-
thesis depends on u 2 , and this dependence does not disappear as n ....... 00.
Thus u 2 may be called an unknown "nuisance parameter."
In this situation, the normal-theory test statistic is the t statistic, which is
approximately normal with mean Jln(e) and variance u;(e o), where
Jln(e) = elu, u;(e o) = lin (2.13)
(Problem 4). Again we need u;(e) only at eo. These are not the exact mean
and variance of the t statistic, but the distribution of the t statistic can be
approximated by a normal distribution with this mean and variance. (This
is true whether or not the population is normal.) As a result, the efficacy is the
same as before.
352 8 Asymptotic Relative Efficiency

The following approach to the problem under discussion is easier to


generalize, and also is perhaps more natural when normality is not assumed.
Suppose that the critical value of a test based on the sample mean depends
on an unknown nuisance parameter 0', and we use an estimate of 0' as if it
were the true value. As long as this estimator is consistent, the test has
approximately the same rejection region as the test with 0' known, and hence
its size and power are approximately the same. Therefore, the efficacy
calculated before still applies for purposes of approximate power.
The t test can be regarded as arising in this way but having its critical
value adjusted to make the level exact under normality. Specifically, an
upper-tailed test based on the sample mean T,. would reject for

(2.14)

If the sample standard deviation S is substituted for (J' when (J' is unknown, the
resulting test rejects for

(2.15)

The previous paragraph states that the test (2.15) is approximately the same
as the test (2.14) which requires the true value of 0', and this holds whatever
that value may be. The t test is of the form (2.15) except that the constant z"
is adjusted to make the test exact under normality. The amount of the
adjustment approaches zero as n - 00, however.
This illustrates the typical situation where a statistic T,. is selected for a
test but the null distribution of T,. depends on one or more nuisance param-
eters. Then we cannot base the test on T,. in the sense of comparing T,. with
a fixed critical value. We can in another sense, however, as follows. Compute
the critical value of T,. as a function of the nuisance parameters, substitute
consistent estimates of the nuisance parameters, and compare T,. with the
resulting critical value. Then the test will be approximately the same as a
test which is based on T,. in the ordinary sense but requires knowledge
of the values of the nuisance parameters. In particular, the efficacy computed
for T,. using (2.10) will relate in the usual way to the power of the test. For
instance, Equation (2.11) will apply (Problem 6a). This will still be true if
the test is adjusted further to make its level exactly IX under some null hypo-
thesis (Problem 6b). In short, the efficacy of T" will typcially apply to the
power of all tests which are based essentially on T,..
There is an alternative argument that can often be used to justify the
approximations of the previous paragraph. For example, since the t test
at level IX = 0.50 can be based on the sample mean, the efficacy computed
from the sample mean applies to the t test at level IX = 0.50. However, since
the efficacy does not depend on the value of IX, the efficacy of the sample
mean must apply to the t test at all levels. This argument is further explained
and used in connection with point estimation in Sect. 3.2.
2 Asymptotic Behavior of Tests' HeurIstic DiscussIOn 353

2.3 Asymptotic Relative Efficiency of Two Tests

Now consider two tests at the same level, which are based on statistics Tl,n
and T2.n with efficacies et.n and e2.n respectively. The power of each test
is given approximately by (2.11). For a given sample size, the two power
functions are then approximately the same shape but they differ by a scale
J
factor e2.n/et.n' That is, the power of the first test at eo + ais approximately
the same as the power of the second test at eo + aJe2.nlet.n. Figure 2.1
illustrates this situation with graphs of normal power functions for one-
sided tests at level (J. = 0.05, that is, <I>(ajk - 1.64) as a function of a, for
several values of k, where k represents e2. nlet. n'
We have seen that typically e t • n is of order n. More specifically, as n ~ 00,
the ratio et.n/n typically approaches t some positive constant et.,

(2.16)

We call e 1. the limiting efficacy per observation or, more briefly, the as'ymptotic
efficacy of the first test statistic. If the limiting efficacies per observation
exist for both test statistics, then the ratio el. nle2.n approaches E 1 : 2, where

lim e 1 • nln
lim e2.nln

(2.17)

Thus, as a large-sample approximation, we may say that, in samples of the


same size, tests at the same level based on Tl,n and T2 • n will have power func-
tions differing by a scale factor l/-JE::;. Formula (2.17) is frequently called
Pitman's formula, and will be repeated later in Sect. 7.
This scale-factor relationship is an important property of the quantity
E 1.2' In all situations which we discuss, this property occurs together with
an even more fundamental property, which is essentially the customary
definition of asymptotic relative efficiency and which we now develop.
Let us consider the same two tests and ask when the first test with sample
size n1 will have the same power as the second test with sample size n2' The
power of each test is given approximately by (2.11). From this expression it
is clear that the two tests will have approximately the same power against
any alternative ewhen the scale factors Je:.:
and;;;: are approximately
equal, or
(2.18)

I All limits in this chapter are to be taken as the relevant sample sizes approach infinity, but this

will not be stated expliCitly.


354 8 AsymptotIc RelatIve Efficiency

Since (2.16) implies that el. nl == nl el . and e2.n2 == n2 e2.' substituting these
in (2.18) shows that the power functions of the two tests will be equal when

and hence when nl and n2 are in the ratio

(2.19)

Thus E I: 2 can be interpreted as the limiting ratio of sample sizes for which
the tests have the same power. This is the usual definition of asymptotic
relative efficiency. The foregoing discussion indicates that both this and the
scale-factor interpretation of E I: 2 mentioned earlier will hold in typical
situations.
As an example, consider the asymptotic efficiency of the ordinary sign
test relative to the classical normal-theory test for the same situation. The
asymptotic relative efficiency depends on the family of distributions under
discussion. Let us consider a normal family as one relevant possibility. Let
Tl, n be the test statistic for the sign test, that is, the number of negative
observations. From the results given in (2.9), the efficacy of TI • n for the normal
distribution is

(2.20)

The normal-theory test is based essentially on the sample mean. The test
statistic is the sample mean if (T is known and the t statistic if (T is unknown, but,
as explained in the last subsection, we may proceed as if the sample mean were
the test statistic in both cases. Accordingly, we let T2 ,n be the sample mean
and obtain its efficacy from (2.12) as
_ / 2
e2.n - n (T • (2.21)

The asymptotic efficiency of the sign test relative to the normal-theory test
(for (T either known or unknown), against normal alternatives, is then the
ratio

. el,n 2
E I : 2 = II m - = - = 0.64. (2.22)
e2,n 1t

We have given two interpretations of this result, both applying to large


samples. First, the power of the sign test against a normal alternative with
mean (} = {) is approximately the same as the power ofthe normal-theory test
against f) = oJ2jn = 0.800, when the sample size and 0 are the same for
both tests. Second, the sign test has approximately the same power against
normal alternatives as the normal-theory test based on 64 % as many
observations.
3 AsymptotIc BehavIOr of Pomt EstImators: Heuflstlc DIscussIon 355

3 Asymptotic Behavior of Point Estimators: Heuristic


Discussion
In Sect. 2 we explored heuristically the individual and relative performance
of typical tests in large samples. A similar discussion for point estimators
is the first subject of this section. We shall see that the asymptotic relative
efficiency of two estimators is the same as that of two tests based on those
estimators. Most test statistics, however, are not themselves estimators of
natural parameters, although they are often derived from such estimators.
We shall, therefore, give a general method of obtaining estimators from tests
and thereby justify the conclusion that the asymptotic relative efficiencies
of tests apply also to suitably related estimators. Finally we shall consider
how one might compare two estimators of different quantities. This latter
discussion is not really necessary for later purposes, but will provide addi-
tional insight into the situation. As in Sect. 2, the approximate statements
become exact as the sample sizes approach infinity, but we do not give a
precise formulation of limit statements in this section.

3.1 Estimators of the Same Quantity

Suppose we are considering two estimators of the same quantity, say Ji. For
example, the sample median and the sample mean may be regarded as
estimators of the same quantity if the population median and mean are
assumed to be equal. Let the two estimators be T1,n and T 2 ,n' and suppose
that the true distribution belongs to some specified family of distributions
indexed by a one-dimensional parameter (J. Since (J determines the true
distribution, it also determines the quantity being estimated, which we may
write accordingly as Ji{(J).
Suppose that T1,n and T2 ,n are approximately normal, as estimators
usually are, with mean Ji{(J) and variances ui.n{(J) and uL{(J) respectively.
Now if the definitions of Sect. 2 are applied to T1 , nand T2 , n regarded as test
statistics for a particular null value of (J (which they could be at least within
our one-parameter family of distributions), then their efficacies and their
asymptotic relative efficiency are given by

el, n{(J) = [Ji'{(J)Y /uL{(J), (3.1)

E 1 : 2 {(J) = lim utn{(J)/uL«(J). (3.2)

Note that since Ji'{(J) is the same for both tests, E 1 : z{(J) can be obtained with-
out actually computing Ji'{(J).
The two interpretations of E 1: 2{ lJ) in estimation are similar to the two in
testing. First, the errors ofthe two estimators T1 , nand Tz,nhave approximately
the same distribution except for a scale factor 1/JE 1 . 2 {(J), since both
356 8 AsymptotIc Relative EfficIency

estimators are approximately normal with mean J1(e) and the ratio of the
variance of the first estimator to that of the second is l/E l : 2(e).
Second, if the two estimators are based on samples of sizes nl and n2
in the ratio n2/nl = E 1: 2(lJ), then the distributions of the estimators and
hence of the errors will be approximately the same. To see this, recall again
that both are approximately normal with mean J1(lJ). They will, therefore,
have approximately the same distribution if their variances are approximately
equal, and hence if their efficacies, given by (3.1), are approximately equal.
But this is exactly the condition of Equation (2.19) for the tests based on
T1 ,n, and T2 ,n2 to have approximately the same power function, except that
the dependence on lJo was suppressed there while the dependence on lJ is not
suppressed here. Accordingly, by the same argument as in Sect. 2.3, if the
ratio n2/nl is, in the limit, equal to E 1 dlJ), then the two estimators of lJ, T1, n,
and T2 ,n2' will, in the limit, have the same distribution. Note that in general,
this ratio depends on the true value of e, although this dependence will dis-
appear in our examples.
In typical situations, then, El dlJ) will have the two interpretations above
and can be computed for estimators exactly as for test statistics when the
estimators are estimating the same quantity.
As an example, suppose lJ is the proportion negative in some population,
'ft.n is the proportion of negative observations in the sample, and T2 • n =
(t(X/S), where <I> is the standard normal c.dJ. and X and S are the sample
mean and standard deviation. The second estimator is appropriate for
normal populations with unknown standard deviation. If the population
is normal with mean lJ and standard deviation (1, then the asymptotic
efficiency of the first estimator relative to the second can be computed
(Problem 7) as
E (lJ) = (1 + !W 2)Cf>2(W) (3.3)
1:2 p(l _ p)

where w = lJ/(1, P = <I>(w), and 4> is the standard normal density function.
E 1 : 2 (lJ) is plotted as a function of p in Fig. 3.1. When 25% of the population
is negative, for example, El :2(lJ) = 0.66. This implies that the variance of
the sample proportion negative is approximately 1/0.66 = 1.51 times the
variance of <I>(X/S), and that their errors have approximately the same dis-
tribution except for a scale factor 1/J0,66 = 1.23. It also implies that the
sample proportion negative has approximately the same distribution as
<I>(X/S) based on 66 %as many observations. If lJ = 0, then p = 0.5 and the
asymptotic relative efficiency is the same as for the sign test relative to the
t test, namely 2/n = 0.64. As can be seen from Fig. 3.1., the normal-theory
est!mator, <I>(X/S), is always better under normality and may be much
better. Unfortunately, however, if the normality assumption fails, then as
discussed in Sect. 3.2 of Chap. 2, <I>(X/S) will not ordinarily be a natural or
good estimator of the proportion negative at all.
3 AsymptotIc BehavIor of POlllt Estimators: HeuristIc DIscussIOn 357

0.6

0.5

0.4

OJ

0.2

0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p

Figure 3.1

3.2 Relation of Estimators and Tests

The previous subsection makes it appear that estimators have the same
asymptotic relative efficiency as tests. However, more needs to be said con-
cerning which estimators have the same relative efficiency as which tests.
When we start with tests and look for corresponding estimators, the
following difficulty arises. A test can be defined equivalently in terms of
many different test statistics. However, if the asymptotic relative efficiency
of two tests is to apply to their test statistics when regarded as estimators,
then both test statistics must be estimators of the same population quantity,
which must not depend on n. Consider, for instance, the sign test and the
normal-theory test in a one-sample problem. The sign test was defined
earlier in terms of the number of negative observations, but this estimates
a quantity depending on n. An equivalent test statistic which eliminates this
dependence is the proportion of negative sample observations, which
estimates the proportion negative in the population. However, the normal-
theory test statistic, either the sample mean or the t statistic, is not an
estimator of this parameter, and so we still do not have a pair of estimators
which are related in the manner of the previous section and correspond to
the two tests, the sign test and the normal-theory test.
If, on the other hand, we start with estimators, the difficulty is that they
may not appear to be test statistics for the right kind of test. The sample
358 8 Asymptolic RelatIve EfficIency

median, for instance, is not the test statistic for the sign test. Furthermore,
since its distribution depends on the shape of the population, it is a possible
test statistic only under very restrictive assumptions. How, then, can its
efficacy in estimation be related to the efficacy of ~ny interesting test?
One answer is to consider one-tailed tests at the level (X = 0.50. For
example, the upper-tailed sign test at level (X = 0.50 rejects if the number of
negative observations exceeds n/2. This is equivalent to rejecting if the
sample median is negative. Thus the sample median can serve as the test
statistic for a one-tailed sign test at level (X = 0.50, although not at other
levels. As we have seen, however, the efficacy of a test does not depend on the
level. Therefore, the efficacy of the sample median must be asymptotically
the same as the efficacy of the sign test (Problem 8). While we have been
talking about the sign test for the null hypothesis that the population median
is 0, the argument extends immediately to any hypothesized value of the
median. A sign test of the null hypothesis that the median is f.J.o is ordinarily
based on the number of observations less than f.J.o, but the sample median
can again serve as the test statistic for a one-tailed test at level (X = 0.50.
Thus the efficacy of the sample median as an estimator of the population
median in any family of distributions will be asymptotically the same at
each value of the population median as the efficacy of the sign test for that
value of the popUlation median. Of course, this can be verified directly
(Problem 9). However, the foregoing argument relates the sample median
to the sign test explicitly and shows that their efficacies must be asymptotically
equal and need not both be computed individually.
A similar argument relates the sample mean to the t test (Problem 10).
(Section 2.2 presented another argument, namely that the t test at any level
is asymptotically equivalent for present purposes to a test based on the sample
mean.)
To generalize the argument relating a family of tests for a parameter f.J.
to an estimator of f.J. with asymptotically the same efficacy, suppose we per-
form a one-tailed test at level (X = 0.50 for each value of f.J.. Then for any given
set of observations there is usually one value of f.J. such that all values on one
side of it would be accepted, and all values on the other side rejected. This·
point of division, the value of f.J. which would be "just accepted" at the level
ex = 0.50, may be considered an estimator of f.J. corresponding to the family
of tests in question. It is just the confidence bound for f.J. at level 1 - ex =
0.50 which corresponds to the family of tests at level ex = 0.50. Its efficacy
might be very difficult to obtain directly. However, since this estimator could
be used as a test statistic for anyone of the tests at level ex = 0.50, its efficacy
at any value of f.J. must be asymptotically the same as that of the test for that
value of f.J..
In summary, the foregoing argument shows that, given a family of tests
for f.J., the corresponding 50 % confidence bound is a naturally related
estimator of f.J., with asymptotically the same efficacy.
3 Asymptotic BehavIOr of Pomt Estnnators: Heuflstlc DIscussIon 359

Estimators Derived ji'om One-Sample Tests

We now apply the foregoing method to derive estimators from one-sample


tests when the population is assumed symmetric about some point Jl. Earlier
chapters discussed many tests of the null hypothesis Jl = 0, including the
sign test and other signed-rank tests. Recall that any such test generates a
family of tests, one for each value of Jl, as follows. To test a null value of Jl,
subtract that number from every observation, compute the test statistic,
and compare it with an appropriate quantile of its null distribution as
critical value. For a one-tailed test at level rt. = 0.50 this quantile is the
median. Comparing the test statistic (computed after subtracting Jl) with
the median of its null distribution is equivalent to comparing Jl with the
amount that must be subtracted from every observation to make the test
statistic equal to the median of its null distribution. Therefore, an estimator
of Jl corresponding to such a family of tests is the amount that must be sub-
tracted from every observation to make the test statistic equal to the median
of its null distribution.
In the case of the one-tailed sign test, for example, a test statistic is the num-
ber of negative observations. Its null distribution has median n12, barring
complications due to discreteness. If we ignore the problems of zeros, ties,
and whether n is even or odd, the amount that must be subtracted from every
observation to make the number of negative observations equal to nl2 is
the sample median. As we have already seen, a one-tailed sign test at level
rt. = 0.50 for any Jl can be carried out by comparing the sample median with Jl.
In the case of the Wilcoxon signed-rank test, it is not quite so obvious
what the related estimator is. It is obvious, however, that an estimator can
be defined as the amount that must be subtracted from every observation
to make the test statistic equal to the median of its null distribution. And of
course, a one-tailed Wilcoxon signed-rank test at level rt. = 0.50 for any
center of symmetry Jl can be carried out by comparing this estimate with Jl.
It is interesting to note that the estimate is actually the median of the set of
all Walsh averages, that is, the median of the set of n(n + 1)/2 numbers of
the form (Xi + X j )/2 for i ~ j (Problem 11).
Other one-sample tests give rise similarly to estimators of an assumed
center of symmetry Jl.

Estimators Derived from Two-Sample Tests

In the two-sample shift problem, two-sample tests give rise to estimators of


the shift parameter Jl in much the same way. Specifically, the estimator is the
amount which must be subtracted from every observation of the second
sample to make the test statistic equal to the median of its null distribution.
For any Jl, a one-tailed test at level rt. = 0.50 of the null hypothesis that Jl is
360 8 Asymptotic Relative Efficiency

the amount of the shift by which the populations differ can be performed by
comparing this estimate with fl. The estimator corresponding to the two-
sample t test in this way is the difference of the sample means. Similarly, the
difference of the sample medians corresponds to the two-sample median test,
and the median of the set of all differences lj - Xi corresponds to the two-
sample rank-sum test (Problem 12).

Summary

The foregoing discussion shows how to find estimators which correspond


to certain tests even when they are not obvious. This link is illuminat-
ing in itself. In addition, though not essential to either the definition
or the computation of the asymptotic relative efficiency of estimators, it
eliminates the need for such computations if the efficiency of the corres-
ponding tests has already been computed. This is especially advantageous ifit
is difficult to compute the asymptotic relative efficiency of an estimator, as is
true for the estimators corresponding to the Wilcoxon signed-rank test and
the two-sample rank-sum test.
In some situations there are estimators which can serve directly as test
statistics at all levels. An example of this was given at the end of Sect. 3.1.
In such situations, the argument of this subsection is unnecessary, and
calculating the asymptotic relative efficiency for the estimators directly is
virtually the same as calculating it for the tests.

*3.3 Estimators of Different Quantities

Now let us consider two arbitrary estimators T1,n and Tz,n for (presumably)
different quantities. There is no natural way to make a direct comparison
between Tl, n as an estimator of one quantity and Tz,n as an estimator of
another. If Tl,n has smaller variance than Tz,n, this may only be because the
quantity estimated by T1,n is the easier one to estimate. In fact, it could
happen at the same time that some function of Tz,n estimates the same
quantity as T1,n and has smaller variance than Tl,n' Then Tz,n would cer-
tainly be more useful than T1,n even though it has larger variance. Two
estimators can be compared in a straightforward way only when both are
estimating the same quantity.
We could leave the problem here, since there is no really compelling
reason to compare two estimators of different quantities. However, an
instructive comparison turns out to be possible between any two statistics
T1,n and T2 ,n, which need not even be estimates at all. We shall find that
there is a very natural way to use functions of them to estimate the same
quantity asymptotically, and that the asymptotic relative efficiency of the
3 Asymptotic BehavIOr of Pomt Estimators. Heuristic DIscussIOn 361

resulting estimators is the same as the asymptotic relative efficiency of


TI nand T2 n when used for testing in the same situation.
'As in ea~lier sections, consider a family of distributions depending on a
one-dimensional parameter () and suppose that the statistics Tl, nand T2, n
are approximately normal with means IlI,n«(}) and 1l2,nC(}) and variances
ai,n«(}) and atn«(}). How can we obtain estimates of(}from TI,n and T2,n? Since
TI,n is approximately normal with mean IlI,n«(}), it is a natural estimator of
IlI,nC(}). Provided Ill,~, the inverse function (not the reciprocal) of IlI,n, is well
defined, it is natural to estimate () correspondingly by

el,n = Ill,~(TI,n)' (3.4)


Equivalently, 01, n is the solution of
IlI,n(el,n) = TI,n' 0.5)
e
Define 2 n in terms of T2 n similarly.
We no~ find the asy~ptotic distribution of el,n using the general pro-
cedure known as the [) method. Expanding the right-hand side of (3.4) in a
Taylor's series about Ill, n«(}), or the left-hand side of (3.5) about (), gives
(Problem 13)

e n - () = TI,nIl;- , n(IlI,n«(})
I, (})
+ remainder. (3.6)

Since TI,n is approximately normal with mean IlI,n«(}) and variance aL«(}), it
follows that el,n is approximately normal with mean () and variance

(3.7)

e
Similarly 2,n = 1l2.~(T2,n) is approximately normal with mean () and vari-
ance tle 2,nC(}).
Therefore, in large samples, there is a natural way to use TI,n and T2,n for
purposes of estimating (), and the resulting estimators are approximately
normal with the same mean () and variances tlel,n«(}) and tle 2j(}). The
asymptotic efficiency of the estimator of () based on TI , n relative to that based
on T2 ,n is then

E «(}) - r el,n«(}) - I' [1l;,n«(})YlaL«(}) (3.8)


I: 2 - 1m e2,n«(}) - 1l2,n «(})] 21a22,n «(})
1m [ '

which is the same as the asymptotic efficiency of TI , n relative to T2 ,n for testing


hypotheses about ().
Thus E I :2«(}) gives an assessment of the value of TI,n compared to T2,n
asymptotically, both for estimation of () and for testing hypotheses about ().
Note that this entire analysis depends on the assumption that the true dis-
tribution belongs to the family indexed by (). The meaning of () does not
extend in any automatic way beyond this particular family. If it is extended
362 8 AsymptotIc RelatIve Efficiency

in some way, then the tests and estimators for () based on Tl,n and T2 ,n will
generally depend on the distribution assumption, because Jl.l,n and Jl.2,n
do (Problem 14),

4 Asymptotic Behavior of Confidence Bounds


We now study the behavior of typical confidence bounds in large samples,
particularly their behavior relative to one another. Because of the relation
between confidence bounds and tests of hypotheses, we can expect the
present investigation to be closely related to that of Sect. 2, and the asymptotic
relative efficiency of confidence procedures to be the same as that of the
corresponding tests. As before, approximate statements should become exact
as the sample sizes approach infinity, but we shan not give a precise formula-
tion of the limit statements here.
Let us consider first an upper confidence bound T" for a quantity Jl. at
confidence level 1 - ix. Suppose, as usual, that the true distribution belongs
to some specified family of distributions indexed by a one-dimensional
parameter (). Since () determines the true distribution, it also determines the
quantity Jl., which we may therefore write as Jl.«(}).
Corresponding to the upper confidence bound T" is a one-tailed test for
each Jl. which rejects the value Jl. if T" < Jl.. Note that this is not a single
test, but a family of tests, one for each value of Jl.. The probability under ()
that T" < Jl.«(}o) is the power of the test for the nun hypothesis Jl.«(}o) against
the alternative given by (). By Equation (2.11), an approximation to this
probability is
Po[T" < Jl.«(}o)] == <1>[±«(} - (}o)JenC(}o) - z .. ], (4.1)
where en«(}o) is the efficacy of the test at (}o and the choice of sign in the
argument of <1> depends on whether the test is appropriate against alternatives
() < (}o or against alternatives () > (}o. Substituting (}o = () + <5/IJl.'«(})1 in
(4.1) leads (Problem 15a) to the approximation

Po[T" < Jl.«(}) + <5] == <1>[IJl.'~(})1 Jen«()) - Z.. ]' (4.2)

Thus we have obtained an approximation to the c.dJ. of T" when the true
distribution is given by ().
The efficacy en«(}) need not be computed from the confidence bound T",
as it can be computed from any test statistic which yields the corresponding
test of the nun value Jl.«()). T" is one such statistic, but is often not the easiest
one to use in computing the efficacy, even in the case ix = 0.50 discussed in
Sect. 3.2. Indeed, the natural test statistic for the nun value Jl. usuany depends
on Jl.. Consider, for example, the confidence bounds related to the Wilcoxon
signed-rank test as in Sect. 4 of Chap. 3. It would be difficult, if not impossible,
4 Asymptotic BehavIOr of Confidence Bounds 363

to compute the mean and variance of this confidence bound directly. How-
ever, the corresponding test for any null value of Ji can be carried out by
subtracting Ji from every observation and then computing the signed-rank
sum. Note that this test statistic is a function of Ji. Its efficacy is easy to
compute for any given II, and is also the efficacy of the confidence bound.
Now suppose we have two upper confidence bounds Tl,n and T2.n for
the same quantity Ji at the same confidence level 1 - ex. The c.d.f. of each is
given approximately by (4.2). In addition, as n --+ 00, e1. n«()/e2. n«() --+ E 1 d(),
the asymptotic relative efficiency of the corresponding tests. It follows
(Problem 15b) that for a given sample size, the c.dJ. of Tl, n at Ji«() + c5 is
approximately the c.dJ. of T2.n at Ji«() + c5JE1:2«(); that is, T1.n - Ji«()
and T2... - Ji«() have approximately the same c.dJ. except for a scale factor
I/JE1 :2«()' (Compare Fig. 2.1.) In particular, the expectation of T1... - Ji«()
is approximately 1/JE 1:i() times the expectation of T2.n - Ji«(). The
quantity 1i. n - Ji«() with its algebraic sign is the amount of overestimation.
One might instead be interested in its positive part,
[7;.n - Ji«()] + = max{1i.n - Ji«(),O},
since it is desirable that an upper confidence bound for Ji be small as long as
it exceeds Ji but not when it is smaller than Ji. Just as for the signed overestima-
tion, the expectation of [Tl,n - Ji«()] + is approximately I/JE 1. 2«() times
the expectation of [T2 n - Ji«()]+. For other measures of "error" or "loss,"
JE 1 :i() is the scale factor in the random variable but not always in the
expected loss (Problem 16).
This provides one interpretation of the asymptotic relative efficiency of
tests in connection with the corresponding confidence procedures. We obtain
another interpretation by considering the two confidence bounds T1 • nt and
T2•n2 , based on samples of sizes n1 and n2 respectively. By the same argu-
ment as in earlier sections (Problem 17), if the ratio n2/n1 is, in the limit,
equal to E 1:2 «(), then the confidence bounds Tl,nt and T2 • n2 will, in the limit,
have the same disttibution and hence the same expected loss for any loss
function.
These conclusions of course apply to lower as well as upper confidence
bounds. As regards two confidence intervals, the previous discussion pro-
vides a comparison between the distributions of the two upper bounds, and
similarly for the two lower bounds. A full comparison would also require
consideration of the joint distribution of the upper and lower bounds, which
we have not discussed. What we already know implies, however, that the ex-
pected length of the first interval is approximately I/JE1 d() times that of
the second (Problem 18). Furthermore, the length of a typical confidence
interval is asymptotically constant, with a standard deviation of smaller
order than its mean. Hence, in comparing two confidence intervals, the
scale factor JE1: 2«() applies asymptotically to both their lengths and the
deviation of their endpoints from Ji«(), and therefore to all other aspects of
364 8 AsymptotIc Relative EfficIency

their relationships to 1l(0). Thus E 1"2(0) has the same kind of interpretation
for confidence intervals as for confidence bounds. A different kind of argu-
ment would be required, however, to establish this for confidence intervals
whose length is not asymptotically constant.
In typical situations, then, the quantity E 1 : 2 (0) will have the interpreta-
tions given above and will therefore be called the asymptotic efficiency of
the confidence procedure Tt •n relative to T2 • n' It is the same as the asymptotic
relative efficiency of the corresponding tests of the null value 1l(0), which is
usually easier to compute directly.
As an example, consider a confidence bound for the population median
based on an order statistic as in Sect. 4 of Chap. 2, and the normal-theory
confidence bound X + ZIlS/Jn for the population median, where X and S are
the sample mean and standard deviation and Zil is an appropriate constant.
Both are confidence bounds for the same quantity if the population mean
and median are assumed equal. (The confidence level of the normal-theory
procedure depends on the population shape, but will be correct asymptotic-
ally.) Let Il be the common value of the population mean and median. The
confidence bound based on the order statistic corresponds to the sign test
for each Il, which is based on the number of observations less than Il. The
normal-theory confidence bound corresponds to the t test for each Il. Under
the normal assumption (the population is normal with mean 0 = Il and
standard deviation 0), the asymptotic efficiency of the first test with respect
to the second was found in Sect. 2.3 to be 2/n = 0.64 at Il = 0 = 0, and the
same value clearly applies at other values of Il. For the two confidence
bounds, this asymptotic efficiency means that under normality the order
statistic bound has approximately the same probability of falling below
Il + b as the normal-theory bound has of falling below Il + bJO.64 = Il
+ 0.80b. In other words, the two confidence bounds differ from Il by amounts
having the same distribution except for a scale factor 1/J0.64 = 1.25. In par-
ticular, the expected amount by which the order statistic bound exceeds Il
is approximately 1.25 times the corresponding expectation for the normal-
theory bound, and the same holds for the expected lengths of confidence
intervals. Furthermore, the order statistic bound has approximately the
same distribution as the normal-theory bound based on 64 % as many
observations.

5 Example
We have seen that the asymptotic relative efficiency of two tests applies
also to the corresponding estimators and confidence bounds, with at least
two interpretations in each case. Accordingly, any numerical efficiency value
has many meanings. We now illustrate this whole range of ideas for two
sets of related procedures in the one-sample problem.
5 Example 365

Consider the sign test for the null hypothesis that the population median
has a specified value. The natural test statistic is the number of observations
falling below the specified value. We thus have a whole family of tests (and
test statistics), one for each value which might be specified. The estimator
of the population median corresponding to this family of tests is the sample
median, as explained in Sect. 3.2. The corresponding confidence bounds are
order statistics, as explained in Sect. 4 of Chap. 2. For convenience, all these
procedures will be referred to in this section as "median procedures;" they
all permit inferences about the population median.
Consider also the family of t tests for null hypotheses specifying the
population mean. The corresponding estimator of the population mean is
the sample mean, as noted in Sect. 3.2, and the corresponding confidence
bounds are the usual ones based on the t distribution. All these procedures
will be referred to here as normal-theory procedures, because they are
exact under the assumption of normality. They give asymptotically valid
inferences about the population mean regardless of the population shape
provided that the variance is finite. (If this were not true, comparisons with
other procedures under assumptions other than normality would be com-
plicated by the discrepancy between their true level and their nominal level.
Such a discrepancy would invalidate all our earlier analysis, and would also
bring a new consideration into the problem-the trade-off between level
and power. See also the discussion of power comparisons using conservative
tests in Sect. 4.3 of Chap. 1.)
Our object here is to compare the median procedures with the normal-
theory procedures under the assumption that the population median and
mean are equal. Before proceeding, however, let us emphasize that these
procedures lead in general to inferences about different quantities. A median
procedure provides an inference about the population median, while a
normal-theory procedure provides an inference about the population mean.
Accordingly, the first question to ask is whether it is really the median or the
mean of the population which is of interest. Careful thought may reveal that
one or the other (or something else entirely) is really the parameter of in-
terest. If so, and if we we are also unwilling to assume that they differ neg-
ligibly, then our choice of procedure will be clear and an efficiency compari-
son irrelevant. Such considerations are at least as important as efficiency.
They receive less attention here simply because they require less explanation.
On the other hand, if we believe that the population median and mean
differ negligibly compared to the uncertainty resulting from sampling
variability, then the relative efficiency of the median and normal-theory
procedures will be of interest. It will also be of interest in situations where
we think the population median and mean may well differ appreciably and
it is immaterial which one the inference concerns-but then perhaps a more
meaningful and useful way of making or scaling the measurements could
be found. In either case, the choice among procedures may be facilitated by
learning something about their relative efficiency. We shall, however,
366 8 Asymptotic Relative Efficiency

compute and interpret their asymptotic relative efficiency only in situations


where the population median and mean have the same value, say J1.. In fact, we
shall consider populations which are symmetric about J1.. It is difficult to
justify the assumption that the population mean and median are equal
except when the stronger symmetry assumption can also be justified.
For the median procedures, the estimator is the sample median, say
T1 • n • For large n, this statistic IS (Problem 19) approximately normal with
mean equal to the population median J1. and variance 1/4nd 2 , where d is the
population density at J1. (provided the population has a positive, continuous
density at J1.). This result will be used below to obtain the efficacy of the
median procedures. Of course, T1• n is not in itself a test statistic or confidence
bound for J1. (except at the level IX = 0.50), but we have already seen that
efficiencies computed from estimators apply also to the corresponding tests
and confidence bounds. Instead of the sample median we could have used
the natural test statIstic for each null value J1., namely the number of ob-
servauons less than J1.. The reader may verify (Problem 9b) that in what
follows this would give the same efficacies to first order as n --+ 00. As it
happens, the efficacies would be exactly the same if computed in the natural
way.
For the normal-theory procedures, the estimator is the sample mean,
say T2 • n. For large n, this statistic is approximately normal with mean J1.
and variance uZ/n, where J1. and u 2 are the population mean and variance.
This fact will be used to obtain the efficacy of the normal-theory procedures.
The t statistic for each value of J1. gives the same efficacies to first order as
n~ 00.
In order to put the problem in the framework of the discussion of Sects.
2-4, we must restrict consideration to a family of distributions indexed by a
single parameter e. A number of such families are considered below. In
each family, the parameter e is the same as the center of symmetry J1., so that
J1.(e) = e.
Suppose first that the population distribution has a Laplace (double
exponential) density
1
f( x' e) = - e- 1x - 9 I!A (5.1)
'2A, ,

where A, is an arbitrary but fixed positive number. This distribution is sym-


metric about e, with mean e and variance 2A,z (Problem 20a). For the sample
median T1 • n and the sample mean Tz. n , we find (Problem 20b)
J1.1.ie) = e, u~.ie) = A,z/n, el.n(e) = n/A,z, (5.2)
J1.Z.n(e) = e, uL(e) = 2A,z/n, ez.i() = n/2A,z. (5.3)
For the Laplace family of distributions, (5.1), the asymptotic efficiency of the
median procedures rda-tive to tfl£ n-ormaf theory p-rocedures is therefore
(5.4)
5 Example 367

This result does not depend on (), as could be anticipated from the nature
of the procedures and the way () enters the density (5.1) (Problem 21d). Ac-
cording to Sects. 2-4, the implications of this result under the model (5.1) are
as follows.
Consider the median procedure and the normal-theory procedure for
testing the hypothesis () = (}o, that is, the sign test and the t test for this null
hypothesis. We assume always that the tests have the same level IX and are
either both one-tailed in the same direction or both two-tailed with the
same division of the significance level between the two tails. If the tests are
based on samples of the same (large) size, then the power of the sign test at a
point (j units away from (Jo is approximately the same as the power of the t
test at a point (j.)2 = 1.41(j units away in the same direction; that is, the
sign test gives approximately the same power at any point as the t test
gives at a point farther from (}o by the factor .)2. In terms of different sample
sizes (stilI large), we may say that the sign test requires approximately one-
half as many observations as the t test to give a specified power at a specified
alternative near the null hypothesis. Both these statements apply to Laplace
alternative (5.1), of course.
Approximations to the power itself can be given in terms of e1,nC(}) and
e2,n«(J); by Equation (2.5) the respective powers against (J of the sign test
and the t test are approximately <I>[Jn«() - (}o)/). - Z/l] and <I>[Jnj2«(J - (}o)/
). - Z/l] for one-tailed tests appropriate against the alternative () > (}o.
Similar expressions could be written for one-tailed tests appropriate against
(J < (}o and for two-tailed tests. One might expect these approximations to
be better than usual here, because when the J.li,n«(J) are linear in () and the
ut.n«(J) do not depend on (J, as here, (2.4) is exact and (2.5) and (2.7) agree
exactly with (2.3) for both the mean and the median. This is misleading,
however, as the mean and median can serve as test statistics only at the level
IX = 0.50. At other levels, other test statistics would be required, and, for
them, (2.5) and (2.7) would not agree exactly with (2.3).
To estimate J.l, the median and normal-theory procedures use the sample
median and mean respectively. In large samples from the Laplace distribu-
tion (5.1), both are approximately normal with mean J.l = (J, and the variance
of the median is approximately one-half of the variance of the mean from
a sample of the same size. The estimation error of the median has approx-
imately the same distribution as 1/.)2 times the estimation error of the mean.
If the sample size for the median is one-half of the sample size for the mean,
their distributions will be approximately the same. The situation is par-
ticularly simple in that the factor one-half does not depend on (). (This
simplification occurs frequently, but not always.)
For large samples from the Laplace distribution (5.1), an upper confidence
bound for J.l computed by the median procedure has approximately the
same probability of falling below J.l + (j as the normal theory bound has of
falling below J.l + (j.)2. The amount by which the former exceeds J.l has
368 8 Asymptollc Relallve Efficiency

approximately the same distribution as 1/.j2 times the amount by which the
latter exceeds Ji. Similar statements hold for a lower confidence bound.
The expectation of the difference between a confidence bound and Ji, or of
the positive part of this difference, or of the length of a confidence interval,
is approximately 1/.j2 times as great for the median procedure as for the
normal theory procedure. The median procedure using a sample size one-
half as large as the normal theory procedure gives confidence bounds having
approximately the same distribution. Again, the implications of these
statements are particularly simple because the factor one-half does not
depend on (J.
All these statements are implied by the single statement that the asymp-
totic efficiency of the median relative to the mean is 2 (for all (J). Note that
these results apply specifically when the true distribution is Laplace, (5.1),
and its center of symmetry (J is the parameter of interest. For a true distribu-
tion with a different shape, the asymptotic relative efficiency will generally be
different, as we shall illustrate next.
Suppose now that the population is normal with mean (J and variance (J2,
where (J2 is arbitrary but fixed, like A above. In this case the asymptotic
efficiency of the median procedures relative to the normal-theory procedures
is (Problems 7c and 21).
E 1 : 2 «(J) = 2/n = 0.64. (5.5)
In large samples, then, the sign test of a null hypothesis (J = (Jo has ap-
proximately the same power at any point as the t test at a point JD.64 = 0.80
times as far from (J o. The estimation error of the median has approximately the
same distribution as 1/JD.64 = 1.25 times the estimation error of the mean.
A confidence interval for (J computed by the median procedure differs from
(J by an amount having approximately the same distribution as 1/JD.64 times
the corresponding difference for the normal theory procedure. All these
statements apply to samples of the same size. The median procedures for
testing, estimation, and setting confidence limits behave approximately like
the normal-theory procedures based on a sample 0.64 times as large. Again
the factor is independent of (J.
lf the population is normal, the median procedures are asymptotically
0.64 times as efficient as the normal theory procedures. On the other hand,
if the population is Laplace, we saw earlier that the median procedures are
asymptotically twice as efficient as the normal-theory procedures. It is
convenient to consider both the normal and Laplace densities as special
cases of the general density (which might be called the" double exponential
power" or "power Laplace" density) given by

f( . (J) - k -Ix-Olkl).k (5.6)


X, - 2Ar(1/k) e
where r denotes the well-known gamma function and k and A are positive.
For any k, the density (5.6) has center of symmetry (J. It is Laplace when
5 Example 369

k = 1 and normal when k = 2. As k ---+ 00, it approaches the uniform density


on the interval (fJ - A, fJ + A) (Problem 22). When the density is of the form
(5.6) and fJ is the parameter of interest, with k and A arbitrary but fixed, the
asymptotic efficiency of the sample median relative to the sample mean is
(Problem 23a) independent of fJ and A, and is given by
er(3/k)
El:ifJ; k) = [r(l/k)]3' (5.7)

Figure 5.l is a graph of (5.7) as a function of k. As k ---+ 00 the limit of (5.7)


(Problem 23b) is
(5.8)
which agrees with an independent calculation of the asymptotic relative
efficiency for the uniform distribution (Problem 23c). The limit as k ---+ 0
is 00, but there is no limiting distribution at k = O. Each value of E 1 . 2 (fJ; k)
has multiple interpretations, as illustrated before for k = 1 and k = 2.
1.otice that, while the normal-theory procedure can be only three times as
efficient as the median procedure for a family of distributions of the form (5.6),
the median procedure may be arbitrarily many times as efficient as the normal-
theory procedure. (See also Sect. 8.3.)
We will return to these and other tests and families of probability dis-
tributions in Sect. 8, after giving a formal definition of asymptotic relative
efficiency in the next section and a formula for its calculation in Sect. 7.

o 2 3 4 5 6 7 8 9 to k

Figure 5.1
370 8 Asymptottc Relattve Efficiency

Those readers who are satisfied with the informal definitions of asymptotic
relative efficiency given in Sects. 2-4 may skip the precise formulation pre-
sented in Sect. 6 and go immediately to Sect. 7.

*6 Definitions of Asymptotic Relative Efficiency

6.1 Introduction

In Sect. 1 we defined the relative efficiency of two procedures which are


applicable in the same situation as the ratio of the sample sizes needed to
achieve the same performance, adding the qualifications necessary to make
this a precise definition. We then defined the asymptotic relative efficiency
of two procedures as the limit of their relative efficiency as the sample sizes
approach infinity, but did not make this definition precise. Instead, in Sects.
2-4 we investigated heuristically and approximately how procedures behave
and how they may be compared as sample sizes become large. The investiga-
tion revealed that, rather than there being a separate numerical value of the
asymptotic relative efficiency for each performance criterion, a great sim-
plification results in large samples, and one number does, in a sense, describe
the entire situation asymptotically. This permits much stronger definitions
of asymptotic relative efficiency than would otherwise be possible.
In this section, we shall give precise statements of a number of asymptotic
properties suggested by the investigation of Sects. 2-4. The statements in
this book about asymptotic relative efficiency apply to most of these pro-
perties, although we shall not prove this rigorously. In the statistical litera-
ture, the standard definition of asymptotic relative efficiency and most
statements and proofs about it refer only to tests and only to the property
involving the ratio of sample sizes (A(ii) below) or a slightly weaker version
of it. Suitably interpreted, the statements ordinarily apply also to the other
properties below, including those for estimators and confidence bounds, but
this cannot be taken for granted and is seldom proved.
There is also a smaller but significant and growing literature about a kind
of asymptotic relative efficiency based on a fundamentally different limiting
operation from the set of related properties discussed here. For a fixed
alternative, it concerns the rate at which one type of error probability
approaches 0 with the other fixed as n ~ 00, or the rate at which the maxi-
mum error probability can be made to approach O. It uses the probability
theory of "extreme deviations." Some early and other references are
Chernoff [1952J, [1956J, Hodges and Lehmann [1956], Blyth [1958],
Bahadur [1960], [1971], and Groeneboom and Oosterhoff [1977].
As already indicated in Sect. 1, asymptotic relative efficiency is a property
of two sequences of procedures, each defined for all n. As in Sects. 2-4, we
6 DefillltlOns of Asymptotic Relative Efficiency 371

always consider a family of distributions indexed by a one-dimensional


parameter 0. The investigations of Sects. 2-4 show that there will typically
exist a quantity E t: 2(0) with the properties described below for tests, estima-
tors, and confidence bounds. This quantity will be called the asymptotic
relative efficiency. We assume throughout that El:2(O) is neither 0 nor 00.
If EtdO) is 0 or 00, some restatement of the properties is needed (Problem 25).

6.2 Tests

Suppose first that we are comparing two test procedures and that 00 gives a
distribution belonging to the null hypothesis for each test. If both tests are
one-tailed, assume that they are appropriate against the same one-sided
alternative and that the exact levels of the tests under 00 approach the same
positive constant as n --+ 00. If both tests are two-tailed, make the same
assumption about the level in each tail separately. Then the asymptotic
efficiency E = E t dOo) of the first test relative to the second can be expected
to have the following properties, anyone of which could be taken as the defi-
nition of asymptotic relative efficiency.
A(i). For two tests based on the sample size n, the difference between
the power of the first test at 0 0 + (j and the power of the second test at
0 0 + (jfi approaches zero uniformly in (j as n --+ 00.

If Pt,n and P 2 ,n are the power functions of the two tests, the condition is
Pt,n(OO + ,,) - P 2 ,n(OO + "ft) --+ 0 uniformly in" as n --+ 00. (6.1)
The statement of (6.1) in B terminology, directly from the definition of uniform
convergence, is that for every B > 0, there is an N not depending on (j such
that, for all n > N and all ",
(6.2)
Uniform convergence is equivalent by definition to convergence of the
maximum absolute difference. Thus (6.1) is equivalent (Problem 24) to
mrlPt,n«10 + (j) - P 2 ,n(00 + c5,jE)I--+ 0 as n --+ 00. (6.3)

This says that, if the two powers at 00 + (j and 00 + (j fi respectively are


graphed as functions of (j (Fig. 2.1, for instance), the maximum vertical dif-
ference between the graphs approaches zero as n --+ 00. Uniform convergence
is also equivalent to convergence for every sequence (jn, so (6.1) is also equival-
ent (Problem 24) to
Pt,n(OO + c5 n) - P 2 ,nCOo + c5 n fi) --+ 0 as n --+ 00 (6.4)
for all sequences (jn' These equivalences indicate the significance of the uni-
formity in (6.1), to be discussed further shortly.
372 8 AsymptotIc RelatIve EfficIency

We saw in Sect. 2.1 how the power function of an arbitrary test can typically
be approximated by a normal distribution as in Equation (2.11). Using this
approximation, along with the fact that the efficacy is typically of order
In, we observe that the powers of the two tests at ()o + on and ()o + onft
appearing in (6.1)-(6.4) typically behave as follows, for an arbitrary E.
If on -.. 0 so fast that JnOll -.. 0, then neither test is effective asymptotically
and both powers approach the significance level ex. If, at the other extreme,
In 011 -.. ± 00, then both powers approach one (or zero for one-tailed tests
in the wrong direction). If 0" -.. 0 in such a way that In d d
011 -.. for nonzero
and finite, then both powers approach limits other than 1, ex, or 0 in general.
These limits will be equal if E = E 1:2«()0), but otherwise will not, with one
minor exception. (For two-tailed tests with unequal tails in the limit, for
each E =1= E 1: 2(00 ) there is one particular value of d such that the limits are
equal and less than ex. See Problem 26.)
Notice that, without the condition "uniformly in 0," (6.1) would lose all
force, since both terms in (6.1) approach one (or zero for one-tailed tests in
the wrong direction) for every fixed 0 =1= 0 regardless of the value of E. But
with the uniformity. property A(i) implies that the power functions will be
the same in the limit except for a scale factor l/ft even when the alternatives
are rescaled so that the limits are not degenerate, specifically, when the powers
are considered as functions of In(O - (0 ),
From the foregoing it also follows that the only value of E with property
A(i) is E = E 1: 2(0 0 ), In most of the properties to follow, uniformity is
essential to the meaning and can be restated in the style of (6.3) and (6.4);
similar comments about degenerate limits and rescaling apply; and similar
converses can be stated. This will not be mentioned each time, but left to the
reader to fill in (Problems 27-29).
The asymptotic relative efficiency property A(i) concerns the power of
the two tests at different alternatives for the same sample size. The next two
properties relate to the same alternative but different sample sizes.

A(ii). For two tests with sample sizes nl and n2 respectively, the difference
between the powers at the same point () approaches zero uniformly in 0
when nl and n2 approach infinity simultaneously in such a way that
n2/nl -.. E.
In the same notation as before, the condition here is
Pl,II,«(}) - P 2,II2(O) -.. 0 uniformly in () if n2/nl -.. E. (6.?)
The final property of the asymptotic relative efficiency E of two tests
which we state is

A(iii). For two tests, if n 1 is the minimum n for which the first test has
power at least 1 - f3 against the alternative 0 and n2 is defined similarly
for the second test, then n2/n 1 -.. E as 0 -.. ()o.
6 DefinItIOns of Asymptotic RelatIve Efficiency 373

Here 1 - fJ is required to exceed the limiting significance level of the tests


and, for one-tailed tests, 0 is restricted to the side of 00 against which the
tests are appropriate. Property A(iii) says that the ratio of the sample sizes
required to achieve a specified power against a specified alternative ap-
proaches E as the alternative approaches 00 and as, consequently, the sample
sizes approach mfinity.

6.3 Estimators

Consider next two estimators TI, nand T2 ,n of the same quantity /l(O). Since
there is no distinguished value of 0 in this context, we return to the notation
E 1 :iO) for the asymptotic relative efficiency, to emphasize its dependence
on O. Continuing our delineation of the properties which we expect E,: 2(0)
to have, we state four such properties, B(i)-B(iv), anyone of which could
serve to define asymptotic relative efficiency for estimators.
B(i). When the true distribution is given by 0, the difference between the
probability of an error of J or less using T1 ,II and the probability of an
error of JJE, :2(0) or less using T2 ,n approaches zero uniformly in J as
the common sample size n ~ 00.
In symbols,
Po[T"n - /leO) ~ J] - PO[T2 ,n - /l(O) ~ JJE 1 :iO)] -+ 0 (6.6)
uniformly in (j as n -+ 00.
This says that the errors of the two estimators have, in an appropriate
sense, asymptotically the same distribution except for a scale factor
I/J E, :2(0) (Problem 30). The errors themselves have distributions which
concentrate at zero as n -+ 00, so that both terms in (6.6) approach zero for
J < 0 and both approach one for (j > O. By uniformity, however, Property
B(i) implies that even when the errors are scaled up in such a way that their
distributions are not degenerate in the limit, the distributions will be the
same in the limit except for the scale factor 1/JE 1 : 2(0). Specifically, when the
errors are scaled up by the factor ~, their distributions typically converge to
normal distributions with mean 0 and variances in the ratio 1/E 1 : 2 (0);
equivalently, the distributions of In J
times the error of T" nand n/E, . 2(0)
times the error of T2 ,n approach the same normal distribution, with mean
zero and positive variance.
Scaling, the importance of uniformity, alternative statements of it, and
the uniqueness of E are mostly much the same as in Sect. 6.2 and will not be
discussed further here, but they should be borne in mind.
B(ii). When the true distribution is given by 0, the ratio of the variance of
Tl,n to the variance of T2 . n approaches l/E,: 2(0) as n ~ 00. The same is
true of the tatio of their mean squared errors. The ratio of the standard
374 8 Asymptotic Relative Efficiency

deviation of TI,n to that of T2,n approaches l/JE1: 2(0). The same holds for
the ratio of their mean absolute errors.
Property B(ii) does not follow automatically from B(i) because the variance
of a limiting distribution, though ordinarily the limit of the variances for
finite n, need not be. For instance, In
[Tl,n - 11(0)] can have infinite variance
for every n even though its limiting distribution has finite variance (Problem
31). Common methods of deriving limiting distributions do not apply to the
limit of the variances. Similar remarks hold for mean squared error, standard
deviation, and mean absolute error. Accordingly, statements and proofs about
asymptotic relative efficiency, in this book and elsewhere, usually apply
directly to B(i), but for B(ii), some statements need qualification, especially
very general ones, and most proofs need additional justification.
B(iii). For two estimators with sample sizes ni and n2 respectively, if the
true distribution corresponds to e, and nl and n2 approach infinity simul-
taneously in such a way that n2/nl -+ E 1 : 2(0), then the difference between
the c.dJ.'s of th~ two estimators converges uniformly to zero.
In symbols,
Po(T1,nl :s; t) - PO(T2 ,n2 S t) -+ 0 (6.7)
uniformly in t ifn2/nl -+ EI:ie).
B(iv). For two estimators with sample sizes ni and n2 respectively, if the
true distribution corresponds to 0 and nl and n2 approach infinity simul-
taneously in such a way that n2/nl -+ E I : ie), then the ratio ofthe variances
of the estimators approaches one, as does the ratio of their mean squared
errors, the ratio of their standard deviations, the ratio of their mean
absolute errors, and indeed the ratio of the expectation of any function of
their errors.
Comments similar to those following property B(ii) apply to B(iii) and
B(iv).
The final phrase ofB(iv) implies that no matter what function of the error
(and even of the true value as well) is chosen to represent the loss of mis-
estimation, the ratio of the expected losses still approaches one if n i and n2
approach infinity in such a way that n2/nl -+ E 1: 2(0). The conditions needed
to prove that this property holds depend of course on the loss function, and
may be strong.

6.4 Confidence Bounds

We consider next two confidence bounds Tl,n and T2 ,n for the same quantity
11(0) and state four properties, which we may expect the asymptotic relative
efficiency E I: 2(e) to have and could use to define asymptotic relative efficiency
for confidence bounds.
6 Definitions of Asymptotic Relative Efficiency 375

C(i). When the true distribution is given by 0, the difference between the
J
c.d.f. of Tl, n at f.1( 0) + (j and the c.dJ. of T2 • n at f.1( 0) + (j E I : i e) ap-
proaches zero uniformly in (j as the common sample size 11 ~ 00.
In symbols,

uniformly in (j as 11 ~ 00.
This implies that, in a relative sense, the two confidence bounds have
J
asymptotically the same distribution except for a scale factor 1/ E 1:2 ( e). The
discussions of scaling, uniformity, and uniqueness of E in Sects. 6.2 and 6.3
apply here with little change and will not be repeated.
C(ii). When the true distribution is given bye, the ratio of the expectation
of TI ,II - f.1(O) to the expectation of T2 ,1I - f.1(e) approaches 1/)E\:2(O) as
11 ~ 00. The ratio ofthe expected overestimations, that is, ofthe expectations
of [Tl,n - f.1(O)] + and [T2 ,n - f.1(e)]+, and the ratio of expected under-
estimations [f.1(e) - TI,n]+ and [f.1(e) - T2 • n]+, also approach I/JE I de)
as 11~ 00.
As with property B(ii), C(ii) does not strictly follow from C(i), and state-
ments about asymptotic relative efficiency often need additional justification
and may need qualification to apply to property C(ii).
The definition of [7;./1 - f.1(e)]+ and the reason for considering it when
7;,/1 is an upper confidence bound were given in Section 4. The reason for
considering [f.1(e) - 7;,/1]+ when 7;,/1 is a lower confidence bound is
analogous.
C(iii). For two confidence bounds with sample sizes III and /12 respectively,
if the true distribution is given by 0 and if 111 and 112 approach infinity
simultaneously in such a way that 112/11 I ~ E I : 2«(J), then the difference
between the c.dJ.'s of the two confidence bounds converges uniformly to
zero.
In symbols,
PoETI,/I, S t] - P8[T2,/l2S t] ~ 0 (6.9)

uniformly in t if 112/111 ~ E I :2«(J).

C(iv). For two confidence bounds for f.1 with sample sizes III and 112
respectively, if the true distribution is given bye, and if 11 1 and 112 approach
infinity simultaneously in such a way that 112/111 ~ EI de), then the ratio
of their expected overestimations approaches one, as does the ratio of the
expected values of any function of their differences from f.1.

Comments similar to those following B(ii), B(iv), and C(ii) apply to C(iii)
and C(iv).
376 8 Asymptohc Relahve EfficIency

6.5 Confidence Intervals

The final properties we introduce apply to two-sided confidence intervals.


D(i). For two confidence interval procedures with the same n, the upper
and lower endpoints each have property C(i), and the ratio of the length of
the first confidence interval to the length of the second converges in prob-
ability to I/J El : 2(B) as n -+ 00 when the true distribution is given by B.
lf L 1 • n and L 2 ,n are the respective lengths, the convergence in probability
means, by definition, that for every 8 > 0,

Ll n
Po [ I --' - I l >8] -+0 as n -+ 00. (6.10)
L 2 ,n JE 1 : 2 (B)
Thus, in addition to property C(i) r"r the endpoints, property D(i) requires
that the ratio of the lengths of the _ .ervals has arbitrarily high probability
J
of being arbitrarily near 1/ E 1 : iB) if n is large enough.
D(iii). For two confidence interval procedures with sample sizes nl and n2
respectively, the upper and lower endpoints each have property C(iii), and
the ratio of the length of the first confidence interval to the length of the
second converges in probability to one if nl and n2 approach infinity
simultaneously in such a way that n2/n1 -+ E 1 : iB) when the true distribu-
tion is given by B.
It is natural to define properties D(ii) and D(iv) in a similar way, re-
quiring the endpoints to have properties C(ii) and C(iv) but requiring con-
vergence in the ordinary sense of the ratio of the expectations of the lengths
(or any functions of the lengths) in place of convergence in probability of
the ratio of lengths. Statements about asymptotic relative efficiency often
need additional justification and may need qualification to apply to such
properties, because they involve expectations.

6.6 Summary

We have now outlined some precise properties of asymptotic relative effi-


ciency. lf we state in this book that two tests have asymptotic relative
efficiency E 1 : 2 (B o) under certain circumstances, then the two tests have the
properties A(i)-(iii) under these circumstances. In the same way, the proper-
ties B, with the qualifications noted, define asymptotic relative efficiency
for estimators, and the properties C and D for confidence procedures. The
heuristic discussion of Sects. 2-4 motIvated thcse definitions and suggests
that they are satisfied in interesting situations. However, nothing has been
proved in this section about their validity in any situation.
The heuristic discussion also suggests that related tests, estimators, and
confidence procedures have the same asymptotic relative efficiency. As a
7 PItman's Formula 377

result, it is not really misleading to think of the quantity E 1: 2(0) as being the
same throughout. However, the definition ofthe asymptotic relative efficiency
of tests, for instance, is quite independent of the definition for estimators
and confidence procedures. Hence it is perfectly legitimate to talk about the
asymptotic relative efficiency of tests without referring to any estimators or
confidence procedures or to their relation to these tests. A rigorous definition
of what it means for tests, estimators, and confidence procedures to be
"related" has not been given here.
In the statistical literature, asymptotic relative efficiency is usually
defined for tests by property A(ii), or a slightly weaker version of it (Problem
32). The other properties for tests are included here because they enrich the
meaning of asymptotic relative efficiency and do generally hold, even though
they may not be mentioned or their validity proved. Occasionally, reference
is made to "Mood's definition" [Mood, 1954] of asymptotic relative effi-
ciency. This applies to one-tailed tests and to two-tailed tests with equal
tails. It ordinarily agrees with the usual definition in the case of two-tailed
tests with equal tails, but it gives the square root of the usual value in the
case of one-tailed tests (Problem 33). Definitions of the asymptotic relative
efficiency of point estimators and confidence procedures do not appear
frequently in the literature. The properties of these procedures which are
most similar to A(ii) for tests are B(iii), C(iii), and D(iii).

7 Pitman's Formula
The asymptotic relative efficiency of two procedures was defined in Sect. 1
as the limit of the relative efficiency, that is, the limit of the ratio of the sample
sizes needed to achieve the same performance. The heuristic discussion in
Sects. 2-4 indicated that a single number describes the relative efficiency
asymptotically for a wide range of measures of performance. This discussion
also led to a formula, that given in (2.17) and usually called Pitman's formula,
for computing asymptotic relative efficiency as the limiting ratio of two
efficacies. We used this formula for calculation in the examples of Sect. 5.
In light of the discussion, we could reasonably define asymptotic relative
efficiency as a quantity with the properties indicated in Sects. 2-4 and
laid out in Sect. 6, or some specified subset of them. Pitman's formula is
not a part of the definition of asymptotic relative efficiency, and numbers com-
puted from it have these properties only under suitable conditions (which,
for many of the properties, are very weak). However, the formula provides
a convenient and widely applicable method for computation of asymptotic
relative efficiencies, and we will use it further below. Nevertheless, we will not
give sufficient conditions for the validity of the formula, nor prove formally
that numbers computed from it have the properties given in Sect. 6.
For convenience and easier reference, Pitman's formula will now be
repeated. Consider a family of distributions indexed by a one-dimensional
378 8 Asymptotic Relative Efficiency

parameter e. Let 7~ be a one-sample test statistic or point estimator and


suppose that when the true distribution is given by e, ~ is asymptotically
normal with mean J.ln(e) and variance a;(e), that is, [~ - J.ln(O)]/an(O) con-
verges in distribution to the standard normal distribution. Assuming J.ln is
differentiable, define the efficacy of ~ by
en(O) = [J.l~(O)]2 /a;(O) (7.1)
and the limiting efficacy per observation or asymptotic efficacy by
e. = lim[en(O)/n]. (7.2)
For finite n, J.ln(O) and a;(O) need not be the actual mean and variance of T,,,
but may be any well-behaved functions of 0 for which the asymptotic nor-
mality holds. This freedom of choice for J.ln and an means that the efficacy
e.(O) is not uniquely defined, but its indeterminacy is of second order only.
Hence the asymptotic efficacy e. is uniquely defined, because it depends on
only the dominant term of en'
Now let Tl,n and T2,n be two test statistics or two point estimators with
asymptotic efficacies el. and e2. respectively. The asymptotic relative
efficiency of TI,n relative to T2 ,n is the ratio el. /e2 .. However, it is not neces-
sary in practice to divide the respective efficacies el,n and e2,n by n or to
compute the limits e l and e2 . . The ratio of the limits can usually be found
more easily as the limit of the ratio of the efficacies, e l ,.(O)/e 2 jO). Accor-
dingly, we give Pitman's formula for a one-sample problem as follows.

E 1 . 2(O) = lim el,n(O) = lim [[J.lI~,n(O)]2. ~L(O) 2J. (7.3)


. e2,n(O) al,n(O) [J.l2,nCO)]
For problems with more than one sample, little modification is required
beyond replacing n by the vector of sample sizes; for two samples, see Sect. 10.
Applying Pitman's formula to corresponding tests, point estimators, and
even confidence bounds gives the same asymptotic relative efficiency.
Whichever equivalent statistic is easiest may therefore be used. Most of the
asymptotic efficiency results which are reported in the statistical literature
are computed by Pitman's formula. Pitman's formula does not apply, how-
ever, to definitions of asymptotic relative efficiency based on different limiting
operations, such as those mentioned in Sect. 6.1.

8 Asymptotic Relative Efficiencies of One-Sample


Procedures for Shift Families
Now that we have a definition of asymptotic relative efficiency for tests,
point estimators, and confidence procedures, and a convenient formula for
its computation, we can find the actual value of the asymptotic relative
8 Asymptotic RelatIve EfficIenCIes of One-Sample Procedures for ShIft Families 379

efficiency for some procedures in various situations. In this section we con-


sider one-sample procedures in the situation of a large class of densities which
satisfy the shift model.

8.1 The Shift Model

A family of densities J(x; e) is called a shift family if


J(x; e) = hex - 0) (8.1)
for some specified density function h. For example, if h is normal with mean
zero and variance (12, thenJ(x; 0) is normal with mean 0 and variance (12.
If h(x) = e -lx 1f.l/(2..1.) with . 1. specified, then J(x; e) is the Laplace density given
in Equation (5.1).
In general,! is the same as h except for a shift by the amount e. The shift
e
is to the right if > 0 and to the left if 0 < O. The density h can have any
number of parameters, but the shift model is a one-parameter family in-
dexed bye, with any other parameters fixed. In other words, e is the parameter
of interest, and any other parameters are" nuisance" parameters.
In hypothesis testing, if the density h of a shift model satisfies the null
hypothesis, then e = 0 belongs to the null hypothesis. The alternative
e > 0, for instance, says that the population has the density h but shifted
e.
to the right by the amount Hence the shift model is frequently referred to
as the shift alternative for hypothesis testing.
The procedures which we shall study below under the shift model are
all shift families of procedures. That is, each test has associated with it,
for each eo, a test performed by subtracting 00 from every observation and
then applying the original test. The null hypothesis for the second test
is the same as that for the first except for a shift by the amount 0 0 , Under a
e e
shift model, if = 0 satisfies the first null hypothesis, then = 0 0 satisfies the
second. Similarly, the estimators and confidence bounds will increase by the
amount k if every observation is increased by k. Consequently, the asymptotic
efficacies and relative efficiencies which we calculate will not depend on 0,
and will apply to any value eo (Problem 34; see also Problem 21d).

8.2 Asymptotic Relative Efficiencies for Specific Shift Families

In this subsection, we obtain the asymptotic efficacies and thereby the


asymptotic relative efficiencies of various one-sample procedures for some
specific families of densities of the shift form. These include the normal,
Laplace (double exponential), logistic, Cauchy, and uniform densities and
the more general symmetric beta and double exponential power families.
The procedures to be considered, tests, estimators and confidence bounds,
are the normal-theory procedures, the median procedures, the Wilcoxon
380 8 Asymptotic Relative Efficiency

signed-rank procedures, the general procedures based on sums of signed


constants, including the squared ranks and normal scores procedures, and
procedures based on the randomization distribution of the sample mean.
We do not consider the Walsh procedures, described in Sect. 7.4 of Chap. 3,
because there is no general rule for constructing them for arbitrary sample
sizes. They are designed ad hoc for sample sizes of 15 or less and hence their
asymptotic efficiency is undefined. Kolmogorov-Smirnov procedures are
also omitted for now, since their asymptotic behavior is fundamentally
different and neither Pitman's formula nor the usual properties of asymp-
totic relative efficiency apply to them. They will be treated in Sect. 11.

Procedures Considered and their Asymptotic Efficacies for Shifts

We start by listing the procedures we consider and formulas for their asymp-
totic efficacies for any shift family h. Thereafter we give numerical values for
the shift families just mentioned.
(1) The normal-theory procedures include the t test for the null hypothesis
that the population mean has a specified value (or the test based on the
sample mean ifthe variance is assumed known), the sample mean as estimator
of the population mean, and the confidence bounds for the population mean
corresponding to the foregoing tests.
The efficacy is most conveniently computed from the estimator, the
sample mean, and is el,n = n/(12 where (12 is the population variance, the
variance of the density h (Problem 35a). Thus the asymptotic efficacy is

(8.2)

(2) The median procedures (in the sense of Sect. 5) are the sign test for the
null hypothesis that the population median has a specified value, the sample
median as estimator of the population median, and the corresponding
confidence bounds for the population median which have appropriate order
statistics as endpoints.
The efficacy is conveniently computed from the test statistic, the number
of observations smaller than the median value specified by the null hypo-
thesis, and is 4nh2(O) if h is a density with median 0 (Problem 35b). The
asymptotic efficacy is then

(8.3)

(3) The Wilcoxon procedures are the Wilcoxon signed-rank test for the
null hypothesis that the population is symmetric about a specified value, the
corresponding estimator of the population center of symmetry, namely the
median of the set of Walsh averages (see Sect. 3,2), and the corresponding
confidence bounds (Sect. 4, Chap. 3).
8 AsymptotIc RelatIve EfficIencIes of One-Sample Procedures for ShIft Fanuhes 381

The Wilcoxon procedures are not valid in general, even asymptotically,


unless h is symmetric about zero. If h is symmetric about zero, the efficacy
is most easily computed from the test statistic, and is (Problem 35c)
e3 II = [2n(n - 1) JIJ2(X) dx + 2nh(O)J2 (8.4)
, n(/1 + 1)(2/1 + 1}/6
The asymptotic efficacy is therefore

(8.5)

(4) The procedures based on sums of signed contants (Sect. 7, Chap. 3)


use, for each n, a set of n constants Cnl' Cn2 , ••• , Cnn ' where the first subscript
n is now included to designate explicitly the dependence of the procedures on
n. The test statistic is Ij= I CnjSj, where Sj = ± 1 and the sign is the sign of
the jth smallest observation in order of absolute value. The relevant pro-
cedures are this test and the corresponding estimator and confidence pro-
cedures for the population center of symmetry. (The estimator is the amount
which must be subtracted from every observation to make the test statistic
have the value zero.)
These procedures are generally valid only if h is symmetric about zero.
It will be proved at the end of this section that if h is symmetric about zero
and sufficiently regular, then the efficacy may be expressed as

e4,n = ttl cnJEh[h'( I X Iw)/h(1 X Iw)]} ZtlC~j (8.6)

where, in the expectation, I X 1(1) < ... < I X I(n) are the absolute values of a
sample of n observations from the density h arranged in increasing order of
size. (This notation is imprecise, since the meaning and distribution of
IX Iw depend on n as well as j.) The asymptotic efficacy is lim e4, n/n, and
will exist only under some restriction on the Cnj.
A common situation is that

Cnj
j -
= bnc ( -n-
t) + remainder, (8.7)

where bn is a constant for each n, c(u) is a function of u defined for 0 < u < 1,
and the remainders are small enough not to contribute asymptotically. For
example, the sign test statistic has this form with Cllj = I, c(u) = 1, and bn = 1.
The Wilcoxon signed-rank test statistic has this form with Cnj = j, c(u) = II,
and bn = n. The factor bn has no effect on the test and is included in (8.7)
for the convenience of using the test statistics as previously defined. For
example, without bn the Wilcoxon signed-rank test would have Cnj = jln,
an equivalent but inconvenient form. The reason for evaluating c(u) at (j -1)/n
rather than j/n in (8.7) is to allow the possibility that c(u) ---+ 00 as u ---+ 1, in
which case c(jfn) could not be used for j = n.
382 8 AsymptotIc Relative EfficIency

Under suitable conditions on c and the remainder terms, if h is symmetric


about zero and sufficiently regular, then we will show at the end of this

rIf
section that the asymptotic efficacy when (8.7) holds is

4{fl:/(2U - l)h,[H- (u)]/h[H- (u)] du


rIf
e4. = 1 1 c 2 (u) du

= 4{LOO c[2H(x) - l]h'(x) dx c 2 (u) du, (8.8)

where H is the c.dJ. of the density hand H- 1 is its inverse. Alternative


expressions appear in Problem 36.
One density h to which (8.6) and (8.8) do not apply is the uniform, for
which, if the range is [ -1, 1], (Problem 37)

(8.9)

When (8.7) holds with suitable conditions on c, the asymptotic efficacy for
the uniform density is (Problem 37)

e4. = c 2(l) Ii~2(U) duo (8.10)

(5) The normal scores procedures are of the foregoing type and satisfy
(8.7) with c(u) = <1>-1[(1 + u)/2] where <I> is the standard normal c.dJ.
For example, we might take Cnj as the quantile of order U - t)/n or j/(n + 1) of
the absolute value of a standard normal random variable, that is, the
(n + j - t)/2n or (n + j + l)/2(n + 1) quantile of the standard normal dis-
tribution, or take cn) = E[ IZ I(j)] where IZ 1(1) < ... < IZ ICn) are the ordered
absolute values of a sample of n from the standard normal distribution (Sect.
9, Chap. 3).
When h is symmetric about zero and sufficiently regular, the asymptotic

r {f:oo r
efficacy of any normal scores procedure can be expressed as (Problem 38)

e s. = 4{foOO <I>-I[H(x)]h'(x) dx = <I>-1,[H(x)]h 2 (x) dx

= {f<l>-I'(t)h[H-1(t)] dtf (8.11)

where H is the c.d.f. of h, <I> - 1 is the inverse of the standard normal distribu-
tion, and <1>-1' is the derivative of this inverse function.
(6) The procedures based on the randomization distribution are the random-
ization test of the null hypothesis of symmetry around a specified value and
the corresponding estimator and confidence procedures, as described in
Sects. 2.1 and 2.2 of Chap. 4. With the sample mean as the test statistic, the
efficacy and limiting efficacy per observation are the same as for the normal-
theory procedures. However, further argument is needed to justify this
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Famlhc~ 383

method of calculation in the case of the randomization procedures. Since


the test compares the sample mean with a quantile of its conditional dis-
tribution given the absolute values of the observations, the critical value of
the sample mean is not fixed but depends on the sample in a complicated
way - more complicated than substituting an estimate for a nuisance
parameter and adjusting a constant as in Sect. 2.2.

Asymptotic Relative Efficiencies for Specific Shift Families

The asymptotic efficiency of anyone of the foregoing listed procedures


relative to any other is now easily obtained by dividing the asymptotic
efficacy of the first procedure by that of the second. The resulting asymptotic
relative efficiency applies, of course, to the particular shift family specified
by the density h, and the two procedures must both be valid for this family,
at least asymptotically. If h is symmetric about zero, all the foregoing pro-
cedures are asymptotically valid. Otherwise, the circumstances and popula-
tion parameters for which they are valid vary from one procedure to another.
For this reason, all the examples below use densities h which are symmetric
about zero.
Unless we are confident of symmetry, the first question in choosing a
procedure is not efficiency but whether the parameter of interest is really the
mean, the median, or something else, as discussed earlier (Sect. 3.2, Sect. 5).
We also point out that a change of scale, such as a change of standard
deviation for the normal distribution, multiplies the efficacies of each of the
foregoing procedures by the same factor, and therefore leaves their asymptotic
relative efficiencies unchanged. Specifically, if hex) is symmetric about zero,
replacing hex) by ah(ax) for some a > 0 multiplies all the foregoing efficacies
by a2 • This can be seen by direct calculation or by consideration of what
happens when all observations are divided by the value a (Problem 39).
Consequently, the asymptotic efficiency of one procedure relative to another
obtained for a particular density h applies also to any rescaling of h, that is,
also to ah(ax) for all a > O.
Table 8.1 gives the numerical values of the asymptotic efficacies of the
foregoing procedures for the shift families generated by various symmetric
densities h. The asymptotic relative efficiencies can be found simply by
taking quotients. We now explain this table.
The column labeled "Mean" in Table 8.1 applies to procedures discussed
in (1) and (6) earlier, namely, the normal-theory procedures, including those
based on the t distribution, and the randomization procedures based on
the sample mean. The columns headed" Median," "Wilcoxon," and" Normal
scores" apply to the procedures so designated earlier (Numbers (2), (3),
and (5». These three procedures belong to the class of procedures based on
sums of signed constants, discussed in Number (4), for the constants cll) = 1, j,
and normal scores respectively. The column headed "Squared ranks"
w
00
Table 8.1 Asymptotic Efficacies and Asymptotic Relative Efficiencies of Some One-Sample Procedures for Some Shift ~

Families
Procedure
Squared Normal Asymptotically
Density h(x) Mean Median Wilcoxon ranks scores efficient

k = 0.5 0.033 1.00 0.19 0.084 0.11 00


ce- Iaxl " k = 0.75 0.21 1.00 0.47 0.29 0.39 1.21
(double k = 1 (Laplace) 0.50 1.00 0.75 0.56 0.64 1.00
exponential k = 1.5 1.10 1.00 1.19 1.08 1.14 1.21
power) k=2 (Normal) 1.57 1.00 1.50 1.54 1.57 1.57
c = ak/2r(1/k) k=4 2.43 1.00 2.12 2.67 2.75 3.33
a = r(1/k)/k k = 10 2.88 1.00 2.61 3.84 4.41 9.15
= r(l + k- I ) k =
00
oo} (uniform) 3.00 1.00 3.00 5.00 00
r = 1
c(1 - a 2 x 2 )'- I, r = 1.5 2.47 1.00 2.16 2.90 4.11 00
Ixl::::; lla; r=2 2.22 1.00 1.92 2.36 2.82 00
Q()
(symmetric beta) r =3 1.99 1.00 1.74 2.00 2.18 2.88 :>
c = a/2 2 ,-IB(r, r) r=4 1.88 1.00 1.67 1.85 1.97 2.20 '<
'"
:;
a = 4'-IB(r,r) r = 00 (Normal) 1.57 1.00 1.50 1.54 1.57 1.57 "'=l
§:
?i
::c
<>
aeox/(eax + 1)2, a=2 (logistic) 1.22 1.00 1.33 1.25 1.27 1.33 ?I
a/n(1 + a 2 x 2 ), a = nl2 (Cauchy) 0 1.00 0.75 0.44 0.53 1.23 :;;
<>
tTl
3l
Note Entnes are asymptotlc efficacies for values of Q specified They are asymptotic efficiencies relatlve to the medl3n procedures and their ratios are <">
;;;
asymptotic relatlve effiCienCies for all values of Q. ?;
'<
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Families 385

applies to procedures based on sums of signed constants for the constants


Cnj = /. The last column, headed "Asymptotically efficient," will be ex-
plained shortly.
Each row applies to the particular density h which is indicated and gives
asymptotic efficacies for the value of a specified. Thus, for instance, the
third row gives the asymptotic efficacies of the procedures just listed for
the Laplace density h(x) = e- 1xl /2, that is for the Laplace family f(x; 0) =
h(x - 0) = e- 1x - ol /2. The fifth row gives asymptotic efficacies for the normal
density h(x) = e- 1r(1.5)x P /2 = e- nx2 / 4 /2, which is not standard normal but
has variance 2/n. For each density h in the table, the scale has been chosen
in such a way that h(O) = ! and hence, by (8.3), that the asymptotic efficacy
of the median procedures is one.
The asymptotic relative efficiency of any two procedures for any density
h in the table is simply the quotient of the appropriate two entries in the row
:-responding to h. In particular, the individual entries are asymptotic
efficiencies relative to the median procedures. As explained before, the'choice
of scale affects the asymptotic efficacies but not the asymptotic relative
efficiencies, which are therefore valid for all values of a. Thus, for instance,
the quotients of the efficacies given in the fifth row are asymptotic relative
efficiencies for a family of normal densities with any variance.
Densities of the first set in Table 8.1 are double exponential power densi-
ties, as in Equation (5.6) with A = 1/a. The cases k = 1 and k = 2 are the
Laplace and normal distributions already mentioned. The uniform distribu-
tion is approached as k ~ 00. In particular, for the value of a specified, the
density approaches the uniform density on the interval ( -1, 1) (Problem
40).
The densities of the second set are symmetric beta densities shifted so
that they are centered at 0 rather than at! and rescaled. For r = 1, we have
the uniform density, just as for k ~ 00 above. The two cases are accordingly
combined in the table. The normal distribution is approached as r ~ 00
(Problem 41). Hence, it appears in two places in Table 8.1.
The logistic density was defined (without a scale parameter) in Equation
(9.11) of Chap. 3. The logistic family f(x; 0) = h(x - fJ) used in Table 8.1
is the same as the two-parameter form given in Equation (7.2) of Chap. 5 with
f1 = 0, (J = l/a. Alternative expressions in terms of cosh z = (e Z + e- Z )/2
appear in Problem 42.
The right-hand column of Table 8.1 gives the Fisher information (per
observation) for each family f(x; 0) = h(x - 0) for the value of (/ specified.
The Fisher information is defined as J(O) = Eo[o log f(X, 0)/00]2 where X has
density f(x; 0), and is in general a function of O. For the shift family it becomes
the constant
J = E,,[h'(X)/h(X)]2 (8.12)
where X has density h(x) (Problem 43). The Cramer-Rao inequality (see
also Sect. 2.1 and Problem 2) states that, under certain regularity conditions,
386 8 Asymptollc Relative EfficIency

the variance of any estimator is at least [1 + b'«()] 2 /n/«() where b is the


bias of the estimator of lJ and b' is its derivative. Since 1 + b'(lJ) = ,1'(0),
this is equivalent to saying that the efficacy per observation cannot exceed
J«(). Hence the same is true of the asymptotic efficacy, and accordingly the
entry in the last column of each row of Table 8.1 is at least as large as every
other entry in that row. A procedure with asymptotic efficacy equal to the
Fisher information in a particular situation is said to be asymptotically
efficient in that situation, provided the properties of asymptotic efficacy we
have given, such as (2.11), hold. (These properties preclude the phenomenon
of "superefficiency" illustrated in Problem 44, and hence so must the
regularity conditions which would be required for rigorous proofs of these
properties.)
For any "well-behaved" one-parameter family of densities f(x; (J),
asymptotically efficient procedures exist. One is maximum likelihood
estimation. The usual tests and confidence procedures based on maximum
likelihood are ordinarily invalid outside the family in question, but ran-
domization procedures based on the maximum likelihood estimator for a
symmetric shift-parameter family are valid for all symmetric distributions
and have the same asymptotic efficacy and hence efficiency (although the
properties of efficacy require special justification for randomization pro-
cedures, as discussed at (6) earlier in the normal-theory case). Another
asymptotically efficient procedure for a symmetric shift-parameter family
is the locally most powerful signed-rank test for that family, as discussed in
Sect. 8.4; this test and the corresponding confidence procedures are also
valid for all symmetric distributions. Note, however, that in either case the
procedure depends on h, and the final column of Table 8.1 does not represent
any single procedure.
For a given family f(x; () = h(x - (J), the asymptotic efficiency of any
procedure of Table 8.1 relative to a procedure which is asymptotically
efficient for that family can be found as the ratio of the appropriate entry in
Table 8.1 to the last entry in the same row, but the asymptotic efficiency
relative to a procedure which is asymptotically efficient for a different
family is not generally available from this table. As before, results on asymp-
totic efficiency are not affected by a change of scale.
The median, Wilcoxon, and normal-scores procedures correspond to
locally most powerful signed-rank tests for the Laplace, logistic, and normal
distributions respectively, as we shall see in Sect. 8.4. Each is, therefore,
asymptotically efficient for its respective distribution and has efficacy equal
to the Fisher information in the appropriate row of Table 8.1.
The asymptotic efficiencies developed in the example of Sect. 5 for the
median procedures relative to the mean (normal-theory) procedures can be
verified from Table 8.1. For the Laplace family, this asymptotic relative
efficiency is 1.00/0.50 = 2.00, in agreement with (5.4). For the normal
family, 1.00/1.57 = 0.64 agrees with (5.5), and for the uniform family, 1.00/3.00
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Famihes 387

agrees with (5.8). The graph in Fig. 5.1 includes these values and several
others which can be verified similarly.
This concludes our explanation of Table 8.1. The numbers in the table
can all be obtained from Equations (8.2), (8.3), (8.5) and (8.8)-(8.11). These
calculations, except those which require numerical integration, are left to
the reader as Problems 45-52.

*Derivations
We will first derive formula (8.6) for the efficacy of the procedures based on
sums of signed constants Cnj' and then formula (8.8) for the asymptotic
efficacy. We assume that h is symmetric about zero and sufficiently regular
to justify the steps to follow.
The test statistic for the null hypothesis 0 = 0 is
n

1',. = L SjCn)
j= 1
(8.13)

where S) = 1 or - 1 according as the jth observation in order of absolute


value is positive or negative. Under the null hypothesis 0 = 0, the Sj are
independent and identically distributed (Chap. 3) with
P(Sj = 1) = P(Sj = -1) = t, (8.14)
and therefore 1',. has null mean and variance (Problem 80, Chap. 3)
n
flnCO) = Eo(1',.) = 0, 0';(0) = L c~. (8.15)
j= I

The remaining problem is to evaluate

, din d
= j~1 Cnj dO EIJ(S)
flnCO) = dO Eo(1',.) 0=0
I 0=0'
(8.16)

Let I X 1(1)' ••• , I X I(n) be the absolute values of the observations ordered
from smallest to largest. Given IXI(I)"'" IXI(n), the conditional probability
that Sj = 1 is (Problem 53)

I J(I X I(j); 0) (8 )
Po(Sj = I I X 1(1)' ... , IX I(n) = J(IX I(j); 0) + J( _I X I(}); 0)' .17

Let J*(ZI"'" Zn; 0) be the joint density of IXI(I)"'" IXI(n) at ZI"'" ZII'
Then
Eo(S) = Po(Sj = 1) - PIJ(Sj = -1) = 2Po(Sj = 1) - 1
= 2Eo[P o(Sj = lII X I(1)'"'' IXI(n)] - 1
388 8 Asymptotic Relative Efficiency

Since f(z; 0) = h(z - 0) and h is symmetric about 0, the derivative of (8.18)


at 0 = 0 (provided it is legitimate to differentiate under the integral sign) is

~ Eo(8)j
dO I -
0;0 - -
Ih'(Z)
h(z) f* (zl>""zn,O)dzl···dzn
.

+ :0 I f*(zl' ... , Zn; 0) dZ I ... dZn Lo


- Eo[h'( IX I(j)/h( IX lu))]' (8.19)
Substituting this result in (8.16), and then (8.16) and (8.15) in Pitman's
Formula (7.1) gives the efficacy in (8.6).
Having the efficacy (8.6), we now seek a formula for the asymptotic
efficacy under assumption (8.7). The denominator of (8.6) is

jt/;j = b; J/2C ~ t) + remainder


= nb; f c 2 (u) du + remainder, (8.20)

since the integral can be approximated by the sum above it divided by n.


The numerator of (8.6) is a little more complicated to handle, but the same
idea can be used in the following manner. When the observations have density
h, their absolute values have c.dJ. 2H - 1 on x > 0. Then, for n large, the
probability is high that IX I(j) will be close to the pth quantile of this c.dJ.,
where p = j/n or U - t)/n or anything else which is asymptotically equivalent.
Since the p-point of2H - 1 is H-I(t + !p), the bracketed quantity in (8.6) is

_1) h'{H-I[l + 1U -l)/n]}


jL c
n (.
=b ~ 2 2 2 + remainder
n ;1 n h{H- 1 [t + tu - t)/n]}

= 2nbn I I
1/2 c(2u
h,[H - I(U)]
- 1) h[H I(U)] du + remamder.
.
(8.21)

Dividing the square of this by n times the quantity (8.20) and letting n -+ w
gives the first formula of (8.8). The second formula follows immediately. *

8.3 Bounds on Asymptotic Relative Efficiencies for Shift Families

We now turn to an investigation of bounds on asymptotic relative efficiencies.


As before the discussion will be limited to shift families of densities f(x; 0) =
h(x - 0). We are mainly concerned with symmetnc, unimodal densities
h. However, all we require, unless further assumptions are specifically
8 AsymptotIc RelatIve Efficiencies of One-Sample Procedures for Shift FamIlies 389

stated, is that h be such that those procedures under discussion are valid, at
least asymptotically.
We consider first the asymptotic efficiency of the normal-theory (mean)
procedures relative to the median procedures. Notice that its value in Table
8.1 ranges from zero for the Cauchy distribution to 3 for the uniform dis-
tribution. (Its value is zero for any density h which is positive at its median
and has infinite variance.) As it happens, the largest value possible for any
shift family given by a density h which has its maximum at its median is 3,
although an asymptotic efficiency of infinity is possible for a symmetric,
multimodal density h (Problem 54). Thus, forj(x; (J) = h(x - (J), the asymp-
totic efficiency of the normal-theory procedures with respect to the median
procedures can be zero, even if h is required to be symmetric and unimodal;
it can be infinite if h is unrestricted or merely assumed symmetric; and it can
be as large as 3, but no larger, if h has its maximum at its median or, therefore,
if h is symmetric and unimodal (because this implies that h has its maximum
at its median).
Next we consider the asymptotic efficiency of the Wilcoxon procedures
relative to the normal-theory procedures. It can be infinite even if h is sym-
metric and unimodal (the Cauchy family again provides an example), and
it can be as small as but no smaller than 108/125 = 0.864 if h is symmetric,
whether or not it is also required to be unimodal. The value 0.864 is achieved
by the symmetric beta density with,. = 2. This density is a quadratic function
with a negative squared term, except that where the quadratic function is
negative the density is zero. A proof that 0.864 is minimum appears at the
end of this subsection under (1).
The asymptotic efficiency of the Wilcoxon procedures relative to the
median procedures can be arbitrarily small, even if h is symmetric and uni-
modal; it can be infinite if h is unrestricted or merely symmetric; and it can
be as large as 3 but not larger if h has its maximum at its median or is sym-
metric and unimodal (Problem 55). The uniform distribution gives the value
3, while the double exponential power family in Equation (5.6) gives values
which approach zero as k ---4 0 (Problem 55).
The normal scores procedures are at least as efficient asymptotically as the
normal-theory procedures for all symmetric h and may be infinitely more
so. The asymptotic efficiency of the normal scores procedures relative to the
normal-theory procedures is infinite for h either Cauchy or uniform; it is
one if h is normal; and it is never less than one if h is symmetric. These results
are proved later under (2). Clearly, a requirement of unimodality would not
improve the bounds.
The asymptotic efficiency of the normal scores procedures relative to the
median procedures can be infinite and it can be arbitrarily small even if h
is symmetric and unimodal. The uniform distribution provides an example
of the former, and Problem 56 of the latter.
The asymptotic efficiency of the normal scores procedures relative to the
Wilcoxon procedures can be infinite, even if h is symmetric and unimodal,
390 8 Asymptotic Relative Efficiency

Table 8.2 Bounds on Asymptotic Relative Efficiencies of One-Sample


Procedures for Shift of a Symmetric Density
Normal
Mean Median Wilcoxon scores

Mean 1 1 0 3" 0 1..li 0 I


108

Median I"
3 00 I 1 I"
3 CIJ 0 00

Wilcoxon 1Q.!!
0 3" 1 I 0 J!
125 00

n
Normal scores 1 00 0 00 "6 00 I I

a If ummodahty IS not assumed, ! must be replaced by 0 and 3 by 00.

and it must be more than but can come arbitrarily close to n/6 = 0.524
if h is symmetric, whether or not h is also required to be unimodal. The
uniform distribution again provides an example of the former, and the
proof of the latter is requested in Problem 57.
Table 8.2 summarizes most of the foregoing bounds. Each cell relates to
the procedures designating its row and column and contains the greatest
lower bound and the least upper bound on the asymptotic efficiency of the
row procedure relative to the column procedure for shifts of a symmetric,
unimodal density. If unimodality is not assumed, the bounds are the same
except in the cases footnoted. Notice that the table is symmetric, in the sense
that, for instance, the greatest lower bound for the median relative to the mean
is 1, while the least upper bound for the mean relative to the median is the
reciprocal of 1 or 3.

*Proofs
(1) What is the minimum asymptotic efficiency of the Wilcoxon pro-
cedures relative to the normal theory procedures for shifts of a density
h which is symmetric about zero? By (8.4) and (8.6), the quantity to be


minimized is

£3;1 = 12a 2 [f h (x) dx


2 (8.22)

Since the scale has no effect, we can fix

a2 = f x 2 h(x) dx, (8.23)

and then our problem is to minimize Jh2(X) dx subject to the condition


(8.23). For any density h satisfying this condition, we have

f h2(X) dx = f h(x)[h(x) - (a - bx 2)] dx +a- ba 2 (8.24)


8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Families 391

for any values of the "undetermined multipliers" a, b. The integrand on the


right-hand side of (8.24) is minimized over all nonnegative h(x) by ha.b(x) =
[a - bx 2 ] + /2, and hence so is the entire expression. We now choose a, b so
that ha • b is a density satisfying (8.23), as it is possible to do. For this a, b, the
function ha b still minimizes the right-hand side of (8.24) among nonnegative
functions ~nd a fortiori among densities satisfying (8.23). For this ha• b the
asymptotic relative efficiency (8.22) is easily calculated as 108/125. Since
ha.h has the symmetric beta form of Table 8.1 with r = 2, this value is already
available from the table. This result was given by Hodges and Lehmann
[1956].
(2) We shall now minimize the asymptotic efficiency of the normal-scores
procedures relative to the normal theory procedures for shifts of a density h
which is symmetric about zero. The asymptotic relative efficiency by (8.2)
and (8.11) is
(8.25)
where

(8.26)

The trivial result


z ::2: 2 - (l/z) for z > 0 (8.27)
applied to (8.26) with
z = <I>-',[H(x)]h(x) = h(x)/4>{<I>-'[H(x)]} (8.28)
gives

Q::2: 2 - Looooc/>{<I>-'[H(X)]} dx. (8.29)

Integrating by parts then gives

Q ::2: 2 - xc/>{<I>-'[H(x)]} [00 - f x<l>-'[H(x)]h(x) dx. (8.30)

The next-to-Iast term of Equation (8.30) vanishes if (J2 < 00 (Problem 58a);
then applying the Schwarz inequality (or the fact that a correlation is at most
one) to the last term gives

Q::2: 2 - ((J2 f{<I>-'[H(X)]fh(X)dX),/2. (8.31)

The integral in Equation (8.31) is equal to one (Problem 58b), and we may
assume without loss of generality that (J2 = 1 since we can rescale H if neces-
sary. Then Q ::2: 1 and
(8.32)
392 8 AsymptotIc Relatlvc Efficlcncy

equality holds if and only if H = <1>. Therefore E5: 1 ;:::: 1 with equality if and
only if h is normal. 0

Proof (2) was adapted from Chernoff and Savage [1958] and is similar
to that of Gastwirth and Wolff [1968] (Problem 59). The reader may wish
to consult these papers for further insight.

*8.4 Asymptotically Efficient Signed-Rank Procedures

In this subsection we will give explicitly the locally most powerful signed-
rank test for a symmetric shift alternative, show that its asymptotic efficacy
equals the Fisher information and hence that the test and related estimators
and confidence procedures are asymptotically efficient for the specified
alternative, and discuss the possibility of an "adaptive" procedure which
would be asymptotically efficient for all symmetric shifts simultaneously.
The signed-rank test which is locally most powerful for a shift family
of densities f(x; (J) :, h(x - (J) with h symmetric about zero is, by (9.8). of
Chap. 3, based on the sum of signed constants
(8.33)
where the IX I(J) are as usual the ordered absolute values of a sample from the
popUlation with distributIOn h (Problem 60). (The meaning of h is different
in Chap. 3.) The efficacy of this procedure for the family f(x; 0) = h(x - 0)
is, by (8.6), equal to

(8.34)

It can be shown (see below) that the asymptotic efficacy of this procedure is

(8.35)

where I is the Fisher information given in (8.12) for the family f(x; 0) =
h(x - (J). As discussed earlier (see Sect. 8.2), no procedure satisfying suitable
regularity conditions has asymptotic efficacy greater than the Fisher in-
formation. Therefore the locally most powerful signed-rank procedure for
the family f(x; (J) = h(x - (J) is asymptotically efficient for this family,
and it has asymptotic efficiency at least one relative to every other procedure
satisfying the regularity conditions. This applies, of course, to the corres-
ponding estimators and confidence procedures as well as tests. Exact regu-
larity conditions which are also simple are hard to find, but the foregoing
statements certainly apply to most standard non parametric and parametric
procedures.
We now have an asymptotically efficient procedure for any given h. Is
there a procedure which is asymptotically efficient for all h simultaneously?
8 Asymptotic Relative Efficiencies of One-Sample Procedures for Shift Fanuhes 393

Apparently so, if we first estimate h and then use a procedure which is


asymptotically efficient for the estimated h. Two difficulties arise here,
namely, how to estimate the density h, and how the preliminary estimation
affects the overall properties ofthe procedure, such as the level and power of a
test. Suppose that we always use a test of 0 = 0 which has conditional level
IX given the absolute values 1Xli, ... ,IX n I. Then the test chosen may depend
on the absolute values in any way we like without affecting the level IX. We
can take advantage of this by using the absolute values to estimate h. If the
observations have a density h which is symmetric about zero for all x, then
their absolute values have density 2h for x > O. Suppose we estimate h from
the absolute values as if they had density 2h, and then find the locally most
powerful signed-rank test for the estimated h at level IX. The overall level will
be IX. Under the alternativeJ(x; (J) = h(x - (J), the absolute values will have
density h(x - (J) + h(x + (J), so we will be treating an estimate of this as if
it were an estimate of 2h(x) in choosing our test. Presumably, this will make
little difference if 1(J 1 is small, while otherwise the power will be close to one
in large samples anyway. Our procedure will therefore have asymptotically
as good power everywhere as the locally most powerful signed-rank test
for the h which actually obtains, whatever it may be. Thus we have obtained
a test which is asymptotically efficient for all h simultaneously. The corres-
ponding estimators and confidence procedures would be even more com-
plicated to compute but would also be asymptotically efficient.
It sounds marvelous to have a procedure which is asymptotically efficient
for all h simultaneously, but there are severe limitations. The density h
cannot be estimated at all without some assumption about its smoothness.
Even under such an assumption, it cannot be estimated reliably except in
very large samples. The density is especially hard to estimate in the tails,
which have a strong influence on the procedure to be used. It appears,
therefore, that the sample size must be extremely large before the simul-
taneous asymptotic efficiency of the foregoing procedure for all h will really
make itselffelt. Aside from this, it should be noted that the efficiency property
applies only to shift parameter families. The foregoing procedure will not
ordinarily be asymptotically efficient for a family not of the form f(x; (J) =
h(x - (J). For example, the locally most powerful signed-rank test for such a
family will ordinarily be asymptotically more efficient. Presumably, no
procedure is asymptotically efficient for all families J(x; (J) simultaneously,
as the development of an asymptotically efficient procedure depends on
making some assumption about the way in which the alternatives" close to"
any particular null distribution approach that distribution.
These and other procedures using some sample information to choose
among more "elementary," conditionally valid procedures have come to be
called "adaptive." See also Stein [1956J, Hogg [1974J, and Sect. 3.2 of
Chap. 4.
PROOF. The asymptotic efficacy of the locally most powerful signed-rank
test given by (8.33) can be obtained by the argument used at (8.21). Since
394 8 Asymptotic Relative Efficiency

Cnj is now equal to the expectation multiplying it in (8.21), this argument gives,
for the efficacy (8.34),
n 2 fl [h'(H- I(U)]]2 .
j~1 Cn) = 2n J1/2 h[H leU)] du + remamder
= 2n L'x'[~~n\(X) dx + remainder
hl(X)] 2 •
= nEh [ heX) + remamder. (8.36)

Formula (8.35) for the asymptotic efficacy follows. An essentially equivalent


proof can be based on (8.7) and (8.8) (Problem 61).
Still another proof can be based on the following expression for the
efficacy (Problem 62), which also provides some insight in itself:

i>2 = i Eh[hl(II
j = 1 n) ) =1 h(
X I{J)] 2
X lu)
- varh[h l( IX I(j)]
h( IX I(j)

= E [hl(X)]2 ~ [hl( IX I(j)] (8.37)


n h heX) ,L.J
)=1
varh h(IXI (j) ) .

The last term here represents the amount by which the efficacy falls short
of the information in n observations.* 0

9 Asymptotic Relative Efficiency of Procedures for


Matched Pairs
This section concerns observations which occur as matched pairs. As usual,
we let Xj be the treatment-control difference in the jth pair and apply one-
sample procedures and results to the n differences, X l ' ... , X n • Thus the
relevant distribution is now the distribution of the treatment-control
differences. This affects the assumptions it is natural to make and hence
the resulting consequences, as we will now discuss. The earlier discussions
of matched pairs in Sect. 7 of Chap. 2 and Sect. 2 of Chap. 3 should be kept
in mind and will not be repeated, although some overlap is inevitable.

9.1 Assumptions

To be specific, let the jth pair consist of a "control" measurement lj and a


"treatment" measurement ltj. Then if Xj = ltj - lj, asymptotic relative effi-
ciencies (and asymptotic power, etc.) can be obtained exactly as before
9 Asymptotic Relative Efficiency of Procedures fOf Matched Palfs 395

from the distribution of the Xj' In particular, the appropriate probability


distribution f(x; 0) to use in Sects. 5 and 8 is that of Xj = Hj - ~,not that
of Hj or Vj individually.
The question then is what assumptions the distribution of Xj might
satisfy. Any family of densities f(x; e) can arise from some family of joint dis-
tributions of (Vj, Uj), and any shift family f(x; 0) = hex - 0) arises from
what might naturally be called a shift family of joint distributions of (~, Uj)
(Problems 63a,b). The consequences of other assumptions about the dis-
tribution of (Vj, Uj) for the differences Xj are not entirely obvious however.
Suppose, for example, that for some constant e, all the Vj and Uj - 0 are
independent and identically distributed. Then varying e gives rise to a shift
parameter family of distributions for Xj' but in addition,

Xj - e= [(Uj - e) - Vj]

is the difference of two independent, identically distributed random variables.


By no means every shift family can arise in this way. In particular, Xj must be
symmetrically distributed around O. This is not the only restriction, how-
ever, and several symmetric densities in Table 8.1, for example, including the
uniform and quadratic (beta with r = 2), are excluded. The same restrictions
arise under broader assumptions, as will be discussed further below.
The assumption that all the Vi and Wi - 0 are independently, identically
distributed, although a special case of assumptions to follow, is unrealistic
in the context of matched pairs, because it implies that the pairing accounts
for none of the variability, contrary to what we expect when we pair.
We now consider several more natural assumptions about Vj and Uj and
see what they imply about Xj' In each case, we assume that the J'i and Uj
are independent between pairs, but not necessarily within pairs; that is, the
pairs (VI> Wd, (V2' W2),···, (Vn' w,,) are mutually independent, but VI
mayor may not be independent of WI' V2 of W2 , etc. It follows that the
n differences X I = Wl - Vl' ... , Xn = w,. - v" are mutually indepen-
dent, as we have assumed the Xi to be throughout this chapter.
Suppose first that (Vj, Uj) is bivariate normal with the same distribution
for each j, so that (Vl, Wl), (V2' W2), ... , (v", w,,) are independent and
identically distributed according to the bivariate normal distribution. Then
the differences X J = Uj - Vj are independently, identically, normally dis-
tributed with mean 0 equal to the treatment population mean minus the
control population mean. Tht:( situation then reduces to that of the earlier
sections with normal alternatives, and the results given there apply here.
The relevant variance is of course the variance of the Xj' which is

(9.1)

in standard notation. Alternatives with 0"; = O"~ are usual but not necessary.
Dropping the bivariate normal assumption, now suppose that there is an
"effect" associated with each pair, which may be regarded as either fixed or
396 8 Asymptotic Relative EfficIency

random, and in addition a fixed treatment effect, but no treatment-pair


interaction. More specifically, suppose that
~ = Ilj + Vi, (9.2)

»j = Ilj + 0 + Wi (9.3)
where Il) is the effect associated with the jth pair, 0 is the treatment effect,
and V~ and Wi are" disturbances" or "errors." If the Vi and Wi are mutually
independent and normally distributed with mean 0 and common variance
(12, this is a normal, fixed-effects or mixed model for two-way analysis of
variance with no interaction; the design consists of two treatments (one is
the control) in blocks of size two. If, more generally, the n pairs of errors
(VII' W II ), (V2, W 2), ... , (V~, W~) are independently, identically and sym-
metrically2 distributed, then the random variables
X) = »j - lj = 0 + Wi - Vi (9.4)
are independently, identically, symmetrically distributed about 0 (Problem
63). The one-sample tests based on symmetry apply to the null hypothesis
o= 0, and the alternative 0 =1= 0 is a shift alternative provided, as we assume,
the distribution of (Vi, Wi) is the same for all O.
Specifically, if Vi and Wi have ajoint density, then Xj will have a density
J(x; 0) = h(x - 0) where h is the density of Wi - Vj. Any density h which is
symmetric around zero, like those in Table 8.1, arises from some symmetric
joint density for Vi and Wj (Problem 63d).
If we assume, however, that the two "errors" V~ and Wj are themselves
independently, identically distributed, with density q, say, then h is given
by

h(x) = f q(x + t)q(t) dt = f


q(t)q(t - x) dt. (9.5)

This density is automatically symmetric about zero, but it is sometimes not


trivial to determine whether a given symmetric density has this form or not.
Many symmetric densities do not, including some of those in Table 8.1, as
mentioned earlier (Problem 64). Furthermore, if the density q of the errors
is not arbitrary but is one of those in Table 8.1, then the density h of the dif-
ferences ofthe errors will not be one of those in Table 8.1 except in two special
cases. If q is normal, then h is normal; if q is Cauchy, then h is Cauchy; but
if q is any other distribution of Table 8.1, then h is not a distribution of
Table 8.1. Thus, Table 8.1 fails to apply generally to the present situation in
two respects: first, some of the distributions in Table 8.1 cannot arise at all;
second, if the distribution of the errors is one of those in the table, the relative
efficiencies are generally different from those given in the table.

2 The pair of random variables (V~, W~) IS said to be distributed symmetrically ("permuta-
tlOnally symmetrically" is more specific but ralely used) if(V~. W~) has the same joint distribution
as (W~. V~). ThIS contrasts with the definitIOn of symmetry around IJ for a single random variable.
9 Asymptotic Relative EfficIency of Procedures for Matched Pairs 397

The relative efficiencies in this situation can, of course, be computed.


The efficacy of the mean is much as before (Problem 66a). As it happens,
the efficacy of the median test here is closely related to that of the Wilcoxon
procedure in the situations of previous sections (Problem 66b). Unfor-
tunately, however, the efficacies of the Wilcoxon, squared ranks, and normal
scores procedures do not seem to be obtainable in closed form for most of
the distributions of Table 8.1 in the present situation.
To summarize, for procedures based on the differences Xj = »j - l-j,
the consequences for Xj of any assumption about the paired observations
(~, »j) can be traced, and asymptotic relative efficiencies can be computed
from the distribution of the Xj under these assumptions. However, the dis-
tribution of the Xj does not usually satisfy the same assumptions as that of
J.j and »j, or have a related univariate form. It is true, however, that Xj is
normal under common normal models for (l-j, »j).

9.2 Bounds for Asymptotic Efficiencies for Matched Pairs

The situation here as regards the bounds on asymptotic relative efficiencies


is similar to that of Sect. 8.3. All of the statements there apply equally to
procedures based on the Xj in any situation where the Xj are independent
and identically distributed with density f(x; (}) = h(x - (}). However, some
of the restrictions on h discussed there are less natural here, and some
restrictions which are natural here were not discussed there.
Suppose, in particular, that the model given by (9.2) and (9.3) holds and
that the 2n errors V~ and W~ are independent and identically distributed with
density q. Then the Xj have density f(x; (J) = h(x - (J) where h is given by
(9.5) and is automatically symmetric. As in Sect. 8.3, the asymptotic efficiency
of the normal-theory procedures relative to the median procedures can be
zero, but now its maximum value is 125/72 = 1.74 (Problem 67a), rather than
3. Several other bounds of Sect. 8.3 cannot be improved here: the asymptotic
efficiency of the normal scores procedures relative to the normal-theory
procedures can be infinite and can be as small as one but no smaller; the
asymptotic efficiency of the Wilcoxon procedures relative to the normal-
theory procedures can be infinite; the asymptotic efficiency of either the
Wilcoxon or normal scores procedures relative to the median procedures
can be arbitrarily small; and the asymptotic efficiency of the normal scores
procedures relative to the Wilcoxon procedures must be more than but can
come arbitrarily close to n/6. None of these statements is altered by requiring
q to be symmetric, unimodal, or both. The proofs are requested in Problem
67. For the remaining cases, sharp bounds appear to be unknown in the
present situation. Of course all the bounds given in Sect. 8.3 are still valid
here, but the added restrictions may make better bounds possible.
The foregoing bounds are summarized in Table 9.1. Each cell contains
lower and upper bounds on the asymptotic efficiency of the procedure for
398 8 Asymptotic RelatIve Efficiency

Table 9.1 Bounds on Asymptotic Relative Efficiencies for Shifts


under (9.5)
Normal
Mean Median Wilcoxon scores

Mean 1 1 0 .Lil 0 125.


I 0
'2 T08

Median 72
00 I 1 .1. 00 O· 00
125 3

Wilcoxon .!QJiu
125 00 0 3· 1 I O· .
2-

Normal scores 1 00 0 00" ~ ex, " 1 1

a Bound taken from Table 8.2. A better bound may be possible.

that row relative to the procedure for that column for shifts of a density of the
form (9.5). Except as indicated by a footnote, the bounds are greatest lower
and least upper bounds, whether or not q is symmetric and/or unimodal.

10 Asymptotic Relative Efficiency of Two-Sample


Procedures for Shift Families
The specific results of Sects. 5 and 8 all related to one-sample procedures. In
Sect. 9 we discussed the applicability of these results to matched pairs. We
now turn to two-sample procedures, such as those of Chaps. 5 and 6. As we
shall see, the heuristic discussion and properties of asymptotic relative
efficiency carryover easily to this situation, and Pitman's formula can again
be used to compute asymptotic efficiencies, and the special formulas and
numerical results of Sect. 8 have natural counterparts here.
The development given in Sects. 2-4 applies here to two independent
samples of sizes m and n with essentially only the changes necessitated by
replacing a single sample size n by a pair of sample sizes m, n. Thus a pro-
cedure based on a statistic Tm. n with mean J.lmj8) and variance (J;,,,/,8) will
ha ve an efficacy
(10.1)

which relates in exactly the same way as the one-sample efficacy to the power
of a test or the distribution of an estimator or confidence limit. A specified
performance (power, for instance) can now, however, be achieved by various
combinations of m and n. Both the total sample size N = m + n and the
allocation, say A = miN, are important. It turns out, however, as one might
expect, that the influence of the allocation is the same for all procedures of
10 Asymptotic Relative Efficiency of Two-Sample Procedures for Shift Families 399

the types we have considered. The role of the limiting efficacy per observation
can be played here by
. N
e. = I1m-e", II' (10.2)
mil .

because this limit is the same however m and n approach w. Equivalently,


em. II = (mnl N)e. = N ,1(1 - A)e. except for an error of smaller order. In
particular, if miN approaches a limit A.', the limit in (10.2) does not depend
on A'.
The asymptotic relative efficiency of two procedures is again given by
Pitman's formula

E l' 2 -_ -e!. -_ I'1m -


e 1 • m• n _ I' [jl'l. m. n(O)] 2
- - 1m 2
O"tm.n(O)
2' ( l{U)
. e2. e2.m.n O"l.m.,,(O) [jl~.m.,,(O)]
In comparing the performance of two procedures, the comparison now is
between a procedure Q 1 based on two mutually independent samples of
sizes m 1 and n 1 , and another procedure Q 2 based on samples of sizes m2
and n2' To exemplify the use of Pitman's formula and the meaning of asymp-
totic relative efficiency here, suppose that we want to compare two sequences
of test procedures Tl and T2 by considering the sample sizes m2 and n2
required by test T2 to achieve the same power against the same alternative
as the test Tl based on ml and nl observations. Let m2 and n2 be the same
multiples ofml and n 1 , that is, m2/ml = n2/nl = E, say, except for rounding.
(Equivalently, E = N 21N 1 and the allocations m21N 2 and m 11N 1 are equal.)
Then the multiple E required approaches the asymptotic relative efficiency
E 1: 2 as the sample sizes approach infinity and the alternative approaches the
null hypothesis. This limit E l' 2 is independent of the level and power specified
and the allocation between samples, as long as t~e allocation is the same for
both procedures.
The definitions of asymptotic relative efficiency in Sect. 6 can all be
adapted to the two-sample case in a similar manner, and comments like
those of Sect. 6 still apply. Pitman's formula in its two-sample form (10.3)
can be used to compute asymptotic relative efficiencies of two-sample
procedures for families of alternative distributions indexed by a single
parameter 0, and the same numerical values hold for corresponding tests,
point estimators, and confidence procedures.
For two-sample shift alternatives, specifically, let us write the respective
population densities in the form

f(x; 0) = hex) and g(x; 0) = hex - 0), (10.4)

so that the null hypothesis 0 = 0 corresponds to the hypothesis of identical


distributions. Then we can calculate the value of the efficacy of two-sample
procedures for shift families exactly as we did in Sect. 8 for the one-sample
procedures.
400 8 AsymptotIc RelatIve Efficiency

Since the limit (10.2) is the same however m and n approach infinity,
that is, whatever the allocation miN, we need not carry out additional com-
putations for the specific procedures discussed in Chaps. 5 and 6 and the
distributions h considered in Sect. 8, because each of the two-sample tests
has a corresponding one-sample limit for which we have already computed
the efficacy. A two-sample test for identical but unspecified populations, based
on samples of sizes m and n, reduces as m ~ 00 with n fixed to a one-sample
test of a hypothesis about the distribution of the Y sample, which is of
finite size n. In particular, the two-sample t test for equality of means ap-
proaches the standard t test for the hypothesis that the Y mean equals a
given value, namely the X population mean as determined from the infinite
X sample; the two-sample median test for identical populations reduces to
the sign test of the hypothesis that the probability is ! that a Y observation
exceeds a given value, namely the X population median (see Problem 27,
Chap. 5); and Problem 69 shows that the asymptotic properties of the two-
sample Wilcoxon rank-sum statistic for identical populations are the same
as the asymptotic properties of the one-sample Wilcoxon rank-sum statistic
for the null hypothesis of symmetry about a given point. Similar reductions
hold for the two-sample normal scores test and the randomization test based
on the difference of the sample means. Accordingly, the results for efficacy
given in Table 8.1 and for bounds on asymptotic efficiency given in Table 8.2
apply to the corresponding two-sample tests here.
Unfortunately, there seems to be no easy way to see that the limit (10.2)
does not depend on how m and n approach infinity except by calculation
in special cases or appeal to powerful theorems. Problem 70a requests such
calculation for the two-sample normal-theory, median, and Wilcoxon pro-
cedures. For a procedure based on a sum of scores Cmnk (Sect. 5, Chap. 5),
the efficacy for the shift alternative (10.4) is (Problem 71a)

(10.5)
If the scores are of the form

Cmnk = bmnc (k-t) + .


~ remamder, (10.6)

then under suitable regularity conditions the limit (10.2) exists and is (Prob-

rI{f
lem 7tb)

{f [f T}
rI{f
e. = c(u)h,[H- 1(u)]lh[H- 1(u)] du C2 (U) du - c(u) du

= {fOOoo c[H(x)]h'(x) dx c2 (u) du - [f c(u) dU]} (10.7)


II Asymptotic Efficiency of Kolmogorov-Smirnov Procedures 401

If h is symmetric and c(1 - u) = -c(u) for all u, this reduces to (8.8) with
c(u) and c[H(x)] in place of c(2u - 1) and c[2H(x) - 1]. Thus the one-
sample and two-sample asymptotic efficacies are the same for shifts of
symmetric densities h if the one-sample function c(1)(u) and the two-sample
function c(2)(u) satisfy
C(2)(U) = -c(2)(1 - u) = c(1pu - 1) for 1- < u < 1. (10.8)
These conditions on c(2) and C(I) are actually quite natural. For instance,
the one- and two-sample normal scores procedures have
c(1,(u) = <1>-1[(1 + u)/2] and c(2)(u) = <I>-I(U),
which satisfy (10.8). The one- and two-sample median and Wilcoxon pro-
cedures also satisfy (10.8), and a two-sample sum-of-scores procedure cor-
responding similarly to the one-sample squared rank procedure of Sect. 8
has scores Cmnk = ±(k - N /2)2, one sign applying for k > N /2 and the
other for k < N /2 (Problem 72).
Thus it can be shown that the efficiencies of appropriately corresponding
one- and two-sample procedures are the same for symmetric densities h, and
that Tables 8.1 and 8.2 apply to two-sample procedures as well as to one-
sample procedures. For asymmetric densities, where the one-sample pro-
cedures are generally not valid, the two-sample procedures are valid and
could also be compared using the formulas above. For the normal theory
procedures we still have e. = 1/([2 and for the median procedures e. =
4h2(~o 5) where ~O.5 is the median of h, but for the Wilcoxon procedure we

r.
must use

e. = 12[Ih(x)h( -x) dx (10.9)

Numerical results will not be given here.


The two-sample tests we have discussed become invalid, in general, if the
population variances are unequal, even if the populations are symmetric
about the same point. Their sensitivity to this departure from assumptions
can be studied asymptotically by methods similar to those discussed here.
See Problems 73-78 and Pratt (1964).

11 Asymptotic Efficiency of Kolmogorov-Smirnov


Procedures
The asymptotic behavior of the Kolmogorov-Smirnov statistics is funda-
mentally different from that of the other test statistics we have considered,
and hence these statistics require separate discussion. We have already seen,
in Sect. 3.2 of Chap. 7, that their asymptotic distributions have a different
402 8 Asymptotic Relative Efficiency

shape under the null hypothesis. This is also true under alternative hypo-
theses, as we shall see below. Accordingly, the asymptotic power function of
a Kolmogorov-Smirnov test has a different shape from that of the other
tests, and its dependence on the level is also different. The sample size
required to achieve a given level and a given power at a given alternative
therefore depends on the level. power. and alternative in a fundamentally
different way. Consequently, the asymptotic efficiency of the Kolmogorov-
Smirnov procedures relative to other procedures depends on the Type I and
Type II errors, (X and p, and therefore will have a much more restrictive
meaning in this section than in the rest of this chapter.
J J
Since the asymptotic distributions of mn/N Dmn and mn/N D:'nare inde-
pendent of the way m and n approach infinity under both the null hypothesis
and nearby alternatives, it is true here as in the previous section that the asymp-
totic efficiency of the Kolmogorov-Smirnov procedures relative to other
procedures with the same allocation miN does not depend on what that
allocation is. Therefore, two-sample efficiencies are again the same as one-
sample efficiencies. We shall treat the one-sample case here since the notation
is simpler, but the results are applicable to the two-sample case once
In is replaced by Jmn/N. One sided and two-sided tests are not simply
related however. We shall discuss one-sided tests first.

11.1 One-Sided Tests

An effective method of computing the asymptotic power of the Kolmogorov-


Smirnov procedures is not known in general. It is easily computed for a
uniform shift alternative, however. If a null distribution which is uniform
with range 1 is shifted by an amount e,
or more generally is shifted by 0
times its range, then the power of the one-sided Kolmogorov-Smirnov test
is approximately (Problem 79a; Quade, 1965, Theorem 4.1)

{e-2(C.-~IiiW if Jne
r.:.
< Ca.
(1Ll)
I if V ne ~ Ca ,
where Ca is the upper (X point of the asymptotic null distribution of JnD:
given in (6.3) of Chap. 7 as

Ca = J - (loge (X)/2. (11.2)


The median test turns out to be a particularly appropriate comparison pro-
cedure here, as we shall see. Its power against the same alternative is ap-
proximately
(11.3)
where Za is the upper (X point of the standard normal distribution, as follows
(Problem 79b) from (2.11) and (8.3). The ratio of sample sizes needed to
II Asymptotic EfficIency of Kolmogorov-Smlrnov Procedures 403

achieve level lI. and power 1 - 13 at the same alternative e therefore ap-
proaches (Problem 79c)
(11.4)
as e -+ 0 and the sample sizes approach infinity. The dependence of the
asymptotic efficiency on lI. and 13 is immediately evident. Some numerical
values are given in Table ILl, which is explained further below. The efficiency
approaches 1 as lI. -+ 0 for fixed 13, it approaches 00 as 13 -+ 0 for fixed lI., and
it approaches (Problem 79d)

[2l1.C a/<P(zcx)]2 = 41tll.2 (lOg ~) eZ~ (11.5)

as 13 -+ 1 - lI., that is, when the alternative approaches the null hypothesis
faster than l/Jn -+ O.
Equation (11.4) gives the asymptotic efficiency of the one-sided, one-
or two-sample Kolmogorov-Smirnov test relative to the one- or two-sample
median test, for uniform shift alternatives. Its asymptotic efficiency relative to
any other test can be obtained by dividing (11.4) by the asymptotic efficiency
of the other test relative to the median test for the same alternatives, for
instance, if the other test appears in Table 8.1, by an entry in the line labeled
uniform.
For other alternatives, the dependence on lI. and 13 will generally differ
from that in (11.4). While no simple, general method of computation is
known, it is possible to obtain bounds which provide some insight.
First, for any symmetric, unimodal shift alternative, (1Ll) is an asymp-
totic upper bound on the power of the Kolmogorov-Smirnov test (Problem
7ge), where e is now the difference between the true and hypothesized c.dJ.
at the median. (In the case of a uniform shift, this agrees with the previous
definition of e, and the uniform is the "most favorable" symmetric uni-
modal shift.) Therefore (11.4) is an upper bound on the asymptotic efficiency
of the Kolmogorov-Smirnov test relative to the median test. If (11.4) is
divided by the asymptotic efficiency of any other test relative to the median
test for any particular symmetric, unimodal shift family, we obtain an upper
bound on the asymptotic efficiency of the Kolmogorov-Smirnov test
relative to the other test for this family.
Second, it is easy to obtain a lower bound (unfortunately very weak) on
the power by observing that the Kolmogorov-Smirnov test will certainly
reject Ho if Fn(ll) - F(Il) exceeds the Kolmogorov-Smirnov critical value
at the median Il, or equivalently, if the median test rejects even when the
Kolmogorov-Smirnov critical value is used in place of the (smaller) median
test critical value for Fn(ll) - F(Il). Approximating the relevant binomial
probability by a normal probability in the usual way, a lower bound on the
approximate power is obtained (Problem 80a) as

(11.6)
404 8 Asymptotic Relative Efficiency

The power of the median test with its own critical value is again approximated
by (11.3), and it follows (Problem 80b) that
(11.7)
Lower bounds relative to tests other than the median test are obtained by
division, as the upper bounds in the previous paragraph were.
Unfortunately, for alternatives very near the null hypothesis, (11.6)
does not even say that the power exceeds (x, and (11.7) is correspondingly
poor. Furthermore, the right-hand side of (11.7) is always less than 1, while
we could hope to be able to prove that the Kolmogorov-Smirnov test is
asymptotically more efficient than the median test for at least some alterna-
tives other than the uniform shift. It is worth noting, however, that the
right-hand side of (11.7) approaches 1 as (X -+ 0 [Capon, 1965] or f3 -+ 0, so
that the Kolmogorov-Smirnov test is asymptotically almost as efficient
as the median test if (X or f3 is small enough. Unfortunately the approach is
very slow, as is evident from Table 11.2.
A lower bound which improves on the previous one for any alternative
G ~ F can be obtained as follows.
Pa{sup[Fn(t) - F(t)] ~ c} = 1 - PG{sup[Fn(t) - G(t) + G(t) - F(t)] ~ c}
2 I - Pa{sup[FII(t) - G(t)] ~ c and FII(p.) - G(p.) ~ c - a},
( 11.8)
where f} = G(p.) - F(p.) as before. This last probability, conditional on Fn(p.),
can be evaluated asymptotically by arguments like those leading to the
asymptotic null distribution, and the expectation of the result over the dis-
tribution of Fip.) can be found (see below). The resulting lower bound on
the asymptotic power is (Problem 82a; Quade, [1965, Theorem 4.2(d) with
T = i])

<I>[2(JnO - cll ) ] + 2(X<I>( - 2Jnf}) - <1>[- 2(JnO +C ll ) ] , (11.9)


which in turn gives the lower bound on asymptotic efficiency as (Problem
82b)
(11.10)
where d is the solution of
f3 = <1>( -d + 2c ll ) - 2(X<I>( -d)+ <1>( -d - 2c ll ). (11.11)
While (11.11) cannot be solved explicitly, a table of f3 as a
function of d
can be generated directly, and a table of d as a function of [J can be obtained
by inverse interpolation. Some values resulting for the lower bound (11.10)
are given in Table 11.1. Lower bounds relative to tests other than the median
test are obtained as before. When (X -+ 0 or f3 -+ 0, this lower bound also
approaches 1 since the first term of (11.9) dominates and (11.10) behaves like
(11.7). As f3 -+ 1 - (x, (11.9) approaches (X as it should (unlike (11.6» but the
approach is so rapid that the right-hand side of (11.10) approaches 0 (Problem
II Asymptotlc EfficIency of Kolmogorov-Smlrnov Procedures 405

Table 11.1 Bounds on Asymptotic Efficiency of the One-Sided Kolmogorov-


Smirnov Test Relative to the Median Test
Power I - /1: * 0.1 0.3 0.5 0.7 0.9 0.99 0.999 0.9999 1.0
Type II Error /1: * 0.9 0.7 0.5 0.3 0.1 om 0.001 0.0001 0
ex

0.9 5.54 10.85 19.06 30.09 w


0.44 0.53 0.57 0.61
0 0.17 0.28 0.35

0.5 2.18 2.48 3.18 5.05 7.44 10.24 w


0.59 0.63 0.67 0.72 0.74 0.76 I
0 0.12 0.28 0.44 0.53 0.58 1

0.1 1.50 1.62 1.75 1.93 2.31 3.24 4.33 5.51 w


0.69 0.74 0.76 0.78 0.80 0.82 0.84 0.85 I
0 0.23 0.36 0.46 0.56 0.65 0.70 0.73

0.05 1.41 1.45 1.56 1.68 1.83 2.17 2.97 3.88 4.86 w
0.71 0.74 0.77 0.79 0.80 0.82 0.84 0.86 0.87
0 0.11 0.34 0.45 0.53 0.62 0.69 0.73 0.76

0.025 1.35 1.42 1.52 1.62 1.76 2.06 2.77 3.57 4.42 w
0.73 0.77 0.79 0.81 0.82 0.84 0.86 0.87 0.88
0 0.23 0.43 0.52 0.59 0.66 0.72 0.76 0.78

om 1.30 1.38 1.48 1.57 1.69 1.96 2.59 3.28 4.01 w


0.75 0.79 0.82 0.83 0.84 0.86 0.87 0.88 0.89 1
0 0.36 0.52 0.59 0.64 0.70 0.75 0.78 0.80 1

0.005 1.28 1.36 1.45 1.54 1.65 1.90 2.48 3.12 3.78 w
0.76 0.81 0.83 0.84 0.85 0.87 0.88 0.89 0.90 1
0 0.43 0.56 0.63 0.67 0.72 0.77 0.80 0.81 1

0.001 1.22 1.33 1.40 1.48 1.58 1.80 2.30 2.83 3.38 w
0.78 0.84 0.86 0.87 0.88 0.89 0.90 0.90 0.91
0 0.55 0.65 0.69 0.73 0.77 0.80 0.82 0.84

0.0001 1.20 1.29 1.36 1.43 1.52 1.70 2.12 2.57 3.03 w
0.81 0.87 0.88 0.89 0.89 0.90 0.91 0.92 0.92 I
0 0.66 0.72 0.75 0.78 0.81 0.83 0.85 0.86

Note: For each IX and P pair, the entry in the first row is the value for umform shifts, (11.4),
which IS also an upper bound for alternatives G most distant from F at the median; the entry
in the second row is the value for Laplace shifts, (11.19), which is also a lower bound for shifts
of a density h satisfying hex) ~2h(lI)min{H(x). 1 - H(x)} where illS the median; and the entry
in the thud row IS the lower bound (11.10) for stochastically one-sided alternatIves All bounds
are valid for symmetric unimodal shifts. Column * applies as f3 -+ I - IX
406 8 AsymptotIc RelatIve Efficiency

82c). The improvement of (11.10) over (11.7) is revealed numerically by com-


paring the results given in Tables 11.1 and 11.2; it is very small.
Symmetric, unimodal shift alternatives which come arbitrarily close to
achieving equality in the bounds given in (11.8)-(11.11) are described in
Problem 83. We have thus found the maximum and minimum asymptotic
efficiency of the one-sided Kolmogorov-Smirnov test relative to the median
test for such alternatives (Equations (11.4) and (11.10) and Table 11.1).
Bounds on asymptotic efficiency relative to other tests could be obtained by
dividing by the minimum and maximum asymptotic efficiency of the other
tests relative to the median test, but these bounds are probably very poor
because different kinds of alternatives are "least favorable" for the two
factors.
* Better bounds on the asymptotic power and relative efficiency of the
Kolmogorov-Smirnov test for specific alternatives other than those achiev-
ing equality in (11.1) and (11.4) or (11.9) and (11.10) can be obtained as
follows. Let J11' J12' ... ' J1r be any r quantiles of G, with J11 < J12 < ... < J1"
J10 = - 00 and J1r+ 1 = 00, and suppose that
G(x) - F(x) ~ J, for J1,-l < x < J1i> i = 1,2, ... , r +1 (11.12a)
G(J1,) - F(J1j) ~ e, for i = 1, 2, ... , r (l1.12b)
with e, ~ max(J" J j + I). (This bounds G - F from below by a step function.
Spikes are allowed at the steps, but presumably one would take ej = (jj or
J,+! except in peculiar cases. Then
P= PG{sup[Fn(x) - F(x)] ::::;; c}
= Pa{sup[F,,(x) - G(x) + G(x) - F(x)] ::::;; c}
::::;; PG[Fn(x) - G(x) ::::;; c - J j for J1j-1 < x < J1i> i = I, ... , r + 1,
and Fn(J1j) - G(J1j) ::::;; c - ej, i = 1, ... , r]. (11.13)

Conditional on F"(J1,) - G(J1,) = u" i = 1, ... ,r, the r + 1 events in the


first line of (11.13) are independent and their probabilities can be evaluated
asymptotically by the reflection principle and a limiting argument as in
Sect. 3, Chap. 7. For Uj ::::;; C - ej, we have Uj_ 1 ::::;; C - J j and Uj ::::;; C - Ji
and the ith conditional probability is

PG[Fn(x) - G(x) ::::;; c - J j for J.lj-I < x < J1;iU" U2,···' ur]
i = 1, 2, ... , r + 1,
(11.14)
where Vi = G(J1j) - G(J1'-I) and Uo = Ur+1 = o. The asymptotic value of
the right-hand side of (11.13) is

i e-E1 ... ie-Er r+n {I -


1
exp[ -(2n/v;)(c - Ji - Ui-l)(C - Ji - u;)]}
o 0 i= 1

g(u!, ... , llr) du!, ... , dll r (11.15)


II Asymptotic Efficiency of Kolmogorov--Smlrnov Procedures 407

Table 11.2 Weak Lower Bound (11.7) on Asymptotic Efficiency of One-Sided


Kolmogorov-Smirnov Test Relative to Median Test
Power 1 - p: * 0.1 0.3 0.5 0.7 0.9 0.99 0.999 0.9999 1.0
Type II Error p: * 0.9 0.7 0.5 0.3 0.1 0.01 0.001 0.0001 0
oc

0.9 0 0.14 0.26 0.34


0.5 0 0.10 0.27 0.44 0.52 0.58
0.1 0 0.22 0.36 0.46 0.56 0.65 0.70 0.73
0.05 0 0.10 0.34 0.45 0.53 0.62 0.69 0.73 0.76
0.Q25 0 0.22 0.43 0.52 0.59 0.66 0.72 0.76 0.78
0.01 0 0.36 0.52 0.59 0.64 0.70 0.75 0.78 0.80
0.005 0 0.43 0.56 0.63 0.67 0.72 0.77 0.80 0.81
0.001 0 0.55 0.65 0.69 0.73 0.77 0.80 0.82 0.84
0.0001 0 0.66 0.72 0.75 0.78 0.81 0.84 0.85 0.86
0 1 1 1 1 1 1

Column • applies as Ii -+ I - IX (Bound approaches 0 )

where 9 is the asymptotic joint density of U 1, .•. , Ur • The product in (11.15)


has 2r + 1 terms in general, but each term is the exponential of a quadratic
(or lower) polynomial in U 1, ••. , Un and 9 is of the same form since U 1, ... , Ur
are asymptotically multivariate normal. Therefore, each term to be integrated
is the exponential of a quadratic form, and completing the square leads to an
expression for its integral as a multiple of an r-variate normal c.dJ. Hence
(11.15) can be expressed as a linear combination of r-variate normal c.dJ.'s,
with 2r + 1 different terms in general. Thus we have obtained an upper bound
on f3 and hence a lower bound on the asymptotic power and relative efficiency
of the one-sided Kolmogorov-Smirnov tests. Upper bounds could be
obtained similarly by bounding G - F from above instead of from below in
(11.12).
The bound in (11.9) was obtained in this way with r = 1, !5 1 = !5 2 = 0,
f. = O. For r = 2, the computation is not difficult since the bivariate normal
c.dJ. can be computed quite easily and has even been tabled. Since the alge-
braic expressions are lengthy and unrevealing and the calculations have not
been carried out, further details will not be given here. For r ~ 3, the com-
putation is more difficult because of the difficulty of computing the multi-
variate normal c.d.f., but certainly feasible for small r.
The following further improvement leads to bounds of the same form,
without any additional complication of the algebra or calculation. However,
somewhat more analytical grasp on G - F is required, because a piecewise
linear bound is used instead of a step function. Specifically, we replace the
right-hand side of (11.12a) by a linear function of G with values Yi and !5 i
at the endpoints JJ..- 1 and JJ.i of the x-interval in question; that is, suppose
that
G(x) - F(x) ~ Y.[JJ.i - G(X)]/Vi + !5 i[G(x) - JJ.i-l]/Vi
for JJ..-l < x < JJ.j, i = 1,2, ... , r + 1 (11.16)
408 8 Asymptotic Relative Efficiency

and that (11.12b) still holds. Then (11.13) changes correspondingly, but
the only effect in (11.14) and (11.15) is to replace the first c5 i by Yi. (This
follows because, given Ui-I and Ui and a linear boundary under the limiting
distribution of F'r> only the distances from Ui-I and u. to the boundary
matter.) The nature of the integrand in (11.15) as a function of UI> ••• , Ur is
thus unchanged.
For r = 1, 15 1 = 61 = Y2 = 0, YI = 15 2 = 0, and J-tl = J-t (the median),
the condition in (11.16) becomes
20G(X) for x < J-t
G(x) - F(x) ~ { 20[1 _ G(x)] for x > J-t. (11.17)

Under thii condition the asymptotic power is at lellst


<I>[2(JnO - ca)] + 2a<l>( - 2JnO)e4 ,foOc, - <1>[ - 2(JnO + ca)]e 8 ,foOc.
(11.18)
and the improved lower bound on asymptotic efficiency is
EKS : 2 ~ (za + zp)2/d 2 (11.19)
where d is the solution of

Some values for the lower bound in this case are given in Table 11.1. When
a -+ 0 or f3 -+ 0, the lower bound (11.19) again approaches 1, but as f3 -+ 1 - a,
it approaches (Problem 84)

{2c a[a - 2<1>(-2ca)]/cI>(za)}2 = 4n[a - 2<1>(-2Ca)]2(IOg~)ez~ (11.21)

rather than O. Thus, this lower bound is effective in the neighborhood of the
null hypothesis, while (11.7) and (11.1 0) were not.
For a symmetric shift family with density h(x - J-t), as J-t -+ 0, the con-
dition (11.17) under which the foregoing bound applies becomes

h(x) ~ 2h(0)H(x) for x < O. (11.22)

A sufficient condition for this to hold is that

h(x)/H(x) is monotonically decreasing for x < 0, (11.23)

or, in terms of the upper tail, that the hazard function h(x)/[1 - H(x)] be
monotonically increasing for x > 0 (increasing failure rate). A sufficient
condition for this, in turn, is that
h'(x)/h(x) is monotonically decreasing. (11.24)
Problem 85 gives further conditions equivalent to these, which help clarify
their relationship.
II Asymptotic Efficiency of Kolmogorov-Smlrnov Procedures 409

Condition (11.24) is satisfied, and hence so are (11.23) and (11.22), and the
bound in (11.19) applies, for the normal, logistic, Laplace, and uniform
distributions, and more generally for the double exponential power dis-
tribution (5.6) with k ~ 1 and the symmetric beta distribution with r ~ 1.
Conditions (11.22) and (11.23) also hold, although (11.24) does not, for the
symmetric beta with r < 1 and the Cauchy distributions. For the double
exponential power distribution with k < 1 (high-tailed distributions),
however, not even condition (11.22) holds, and the bound in (11.19) has not
been proved. Derivation of these results is requested in Problem 86.
For the Laplace distribution, equality holds in (11.19), which therefore
gives the actual minimum, not just a lower bound, under anyone of the con-
ditions (11.17), (11.22), (11.23), or (11.24) (Problem 87).
For shifts of a density h, as (3 --+ 1 - a, that is, in the asymptotic neigh-
borhood of the null hypothesis, the quantity

f
r,
2
{[2aC a/¢(Za)] {2¢[T(u)c a] - l}h,[H- 1 (u)]/h[H- 1(u)] dU}

= {[4aC al4>(Za)] f ¢[T(u)ca]h[H- 1(u)]T'(u) du (11.25)

where
T(u) = (2u - 1)/[u(1 - u)r /2 , (11.26)
plays the role of the asymptotic efficacy of the one-sided Kolmogorov-
Smirnov test. In other words, (11.25) can be divided by the limiting efficacy
per observation of another test to obtain the asymptotic efficiency of the
Kolmogorov-Smirnov test relative to the other test when {3 is very close to
1 - a. Equation (11.25) may be compared to (8.8) for tests based on sums of
signed constants or scores. Unfortunately, it is typically difficult to evaluate,
as well as being applicable only as {3 --+ 1 - a. For proof, see Hajek and
Sidak [1967]. *

11.2 Two-Sided Tests

We turn now to two-sided tests. One might expect the results for a one-
sided test at level a/2 to apply to the two-sided test at level a, and some of
them do approximately in some situations, as we shall see. They do not
apply directly or exactly, however, and may not apply even approximately,
because they do not take account of either rejection in the" wrong" direction
or rejection in both directions at once.
The possibility of rejection in both directions at once implies that the
rejection region for a two-sided Kolmogorov-Smirnov test is the union of two
one-sided regions which are neither mutually exclusive nor each at level
410 8 Asymptottc RelatIve EfficIency

exactly rx/2. As far as the null hypothesis is concerned, however, as ob-


served in Sect. 4.2 of Chap. 7, the numerical difference between the two-
sided P-value and twice the one-sided P-value is negligible whenever the
latter is no larger than 0.1. Thus, at usual levels rx, the error in the level of a
two-sided test will be negligible if we use the one-sided regions for level
rx/2. Specifically, the error is the probability under the null hypothesis of the
intersection of the two one-sided regions, that is, the probability of rejection
in both directions at once, and this is negligible for rx ::; 0.2.
Under alternatives, however, this probability may not be negligible. For
instance, it is far from negligible under changes of scale. Even if it were
negligible, there would remain the problem that the contribution to power
from rejection in the wrong direction may be different for the Kolmogorov-
Smirnov test than for other tests, especially as its power function has a
different shape anyway. For some alternatives (such as scale) against which
the Kolmogorov-Smirnov test might be used, it is not even clear which
direction is "wrong."
There is one set of assumptions which leads to a simple result. Suppose that
it is clear which region is "wrong" for rejection, as it is in the situations we
have been concerned with, and in particular for shift alternatives. Suppose
also that we adopt the three-conclusion interpretation of two-sided tests
(see Sect. 4.6, Chap. 1) and interpret rejection in both directions at once as a
correct decision. Then the probability of a correction decision is just the
probability of rejection in the right direction, by the corresponding one-
sided test, for the Kolmogorov-Smirnov test as well as for the other tests
we have been considering. This makes the earlier results for one-sided tests
at level rx/2 numerically applicable with negligible error to two-sided tests
at level rx, with the same /1, provided that rx ::; 0.2. This applies to both the
bounds and the maximum and minimum efficiencies. Furthermore, one
could obtain theoretically and numerically exact results, for all rx, by sub-
stituting the exact two-sided critical value rather than CIl /2 for CIl , while still
substituting rx/2 for rx and Z",/2 for Z"" but leaving those quantities involving
{J unchanged.
Things are unfortunately less simple under the two-conclusion inter-
pretation of tests, which we consider next. Rejection in the "wrong" direc-
tion is now a correct decision under the alternative hypothesis, and must be
taken into account. Near the null hypothesis, it represents an appreciable
fraction of the power, for both the Kolmogorov-Smirnov tests and asymp-
totically normal tests. Far from the null hypothesis, it represents a negligible
fraction of the power for asymptotically normal tests, but the situation is
more complicated for the Kolmogorov-Smirnov tests. Of course, since
ignoring rejection in the "wrong" direction reduces power, the lower bounds
on the power of the one-sided test at level rx/2 still apply. They may be weak,
however. Furthermore, the upper bounds on power are invalid; both upper
and lower bounds on relative efficiency are also invalid since they require
upper bounds on the power of one of the tests being compared. The theory
II Asymptottc Efficiency of Kolmogorov-Smirnov Procedures 411

of maximum and minimum relative efficiencies is affected not only by this


but also by the fact that the" least favorable" alternatives in the one-sided
case are not or may not be least favorable in the two-sided case.
For a normal-theory test, numerical investigation indicates that ignoring
rejection in the wrong direction understates power little enough so that the
sample size required to achieve a given Pis overstated by no more than 2 %
provided that ex $; 0.1 and p $; 0.74, or ex $; 0.05 and p $; 0.85, etc. (Problem
88). In this region, therefore, the lower bounds on the efficiency of the one-
sided Kolmorogov-Smirnov test at level ex/2 relative to an asymptotically
normal test apply to the two-sided test at level ex after multiplication by
0.98. However, such lower bounds, whether based on the exact normal power
or an upper bound, are not tight and may be appreciably lower than neces-
sary because they ignore the probability of rejection in the "wrong" direc-
tion by the Kolmogorov-Smirnov test; the nature of the tests suggests that
this probability may be larger than for asymptotically normal tests.
An upper bound on the power or efficiency of the KolmogoroOv-Smirnov
test requires an upper bound on its probability of rejection in the" wrong"
direction but not also the "right" direction. As long as the alternative is
one-sided in the sense G(x) ~ F(x) for all x, then the probability of rejection
in the" wrong" but not the" right" direction does not exceed ex/2 (Problem
89a). Although this bound seems rather crude, without further assumptions
it cannot be improved asymptotically, even for shift alternatives, except to
account for rejection in both directions at once. Accounting for this, we have
[Quade, 1965, Theorem 4.3(b)], for all G such that F(x) $; G(x) $; F(x) + ()
for all x,

1- P = PG[sup(Fn - F) ~ c or inf(Fn - F) s: -c]


$; PG[sup(Fn - G) ~ c - () or inf(Fn - G) s: -c]. (11.27)

There are alternatives, Including shift alternatives, which give results


arbitrarily close to equality asymptotically (Problem 89b). The upper
bound can be evaluated asymptotically (Problem 90), and a solution for ()
given p can then be found by inverse interpolation or search. The same is
true for the normal power. Thus it would be feasible to calculate a tight
upper bound on EKS : 2 for alternatives G such that

F(x) $; G(x) $; F(x) + () for all x and G(J.l) = F(J.l) + () at the median J.l.
(11.28)

We have not done so, however, because the calculation is rather complicated
and the distributions coming close to equality are multimodal in a patho-
logical way. When additional conditions on G are imposed, it is not known
what form the most favorable distributions or maximum power or efficiency
have, and they are probably of a form leading to even more difficult
calculation.
412 8 Asymptotic RelatIve Efficiency

A simple upper bound is obtained, however, by bounding the probability


of rejection in the" wrong" direction by rx/2 for the Kolmogorov-Smirnov
test and ignoring it for the median test. Thus, under (11.28) the power ofthe
Kolmogorov-Smirnov test satisfies (11.27) and hence
1- p:::; PG[sup(Fn - G) ~ c - OJ + rx/2 == e- n(c-1J)2 + rx/2, (11.29)
where c is the critical value of the two-tailed test. Ignoring the probability of
rejection in the "wrong" direction gives <I>(2jn 0 - za/2) as a lower bound
on the asymptotic power ofthe median test (compare (11.3». Therefore, under
(11.28),
EKS : 2 :::; (Za/2 + zfJ)2/(2c - 2Cl_fJ_aI2)2, (11.30)
where the subscripted c is defined by (11.2) and the unsubscripted c is the
upper rx point of the asymptotic distribution of the two-sided Kolmogorov-
Smirnov statistic and hence is negligibly different from Cal2 for DC :::; 0.1,
as mentioned earlier.
*The general methods described after (11.2) for bounding the asymptotic
power and relative efficiency of the one-sided Kolmogorov-Smirnov
tests under specific alternatives can be extended to the two-sided tests as
follows. The difference G - F must be bounded above and below by a step
function or a piecewise linear function of G. Then on the left-hand side
of (11.14), F n - G is bounded above and below by constants or linear func-
tions of G, and the right-hand side has an infinite series of exponential terms.
Thus the integrand in (11.15) becomes a product of infinite series of the
same type, and the bound on P is, therefore, an infinite series of r-variate
normal c.d.f.'s. Unless the region being used to bound the acceptance region
is very small, that is, unless either DC is very large or the bound is very poor,
the series converges rapidly and calculation would be feasible for r = 2, as
in the one-sided case.
The two-sided problem presents one additional difficulty not present
in the one-sided case, namely that neither the general method just described
nor any ofthe similar methods yields effective lower bounds on the power or
efficiency of the Kolmogorov-Smirnov two-sided test near the null hypo-
thesis, that is, as P---. 1 - DC. The reason is that, as 0 ---. 0, the true power is,
to first order, rx plus a term of order 0 2 while the bound is less than the exact
value by a term of order O. Accordingly, for sufficiently small 0, the bound
is actually less than rx. While this may be unimportant from the point of
view of power, the resulting lower bound on the efficiency is unfortunately
ofor small O. To avoid this, more refined methods seem necessary.*

PROBLEMS

1. Develop an expression similar to (2.11) for the approximate power of lower-tailed


and two-tailed tests and show that the interpretations of E 1:2 in terms of the scale
factor and relative sample sizes continue to apply.
Problems 413

2. Let Iln(O) and a;(O) be the exact mean and variance of a statistic 1'" for some one-
parameter family of densitIesf(xn; 0). Derive the Cramer-Rao mequality [1l~(0)]2 ::;
a;(O)/,,(O), where 1,,(0) = - E[(a 2 /a0 2 )log f(x,,; 0)] IS the FIsher information, by
showing that 1l~(0) = cov(1'", Un) and In(O) = var(U n) where Un = (a/NJ)log f(xn;O).
Conclude that the efficacy satisfies en(O) ::; In(O). Note that, if 1'" is an estimator of 0,
then Iln(O) = 1 + b~(O) where bn(O) is the bias of 1'". See also Problem 43.

*3. Relate the asymptotic efficiency of maximum likelihood estimators to the develop-
ment in this chapter.

4. Show that, for a sample of independently, identically distributed observations with


mean 0 and variance a 2 , the t statistic for Ho: 0 = 00 is approximately normal with
mean O/a and variance lin if n is large and 0 is close to 00 ,

*5. Trace the development of Sect. 2 and give the leading error terms in (2.11) and the
approximations leading to it when 1'" is the sample mean, the population is normal
with variance a 2 (0) depending on the mean, and the null hypothesis is gIven by
00, a 2 (00)·

*6. (a) Derive the order of magnitude of the error introduced in (2.11) and the approxi-
mations leading to it when a consistent estimate is substituted for a nuisance
parameter as described in Sect. 2.2.
(b) Is the order of magnitude of this error changed if the test is adjusted to make its
level exact under some null hypothesis?

7. (a) Derive the asymptotic efficiency (3.3) of the sample proportion negative
relative to the normal-theory estimator of p, the proportion negative in the
population, for a normal population.
(b) Show that this efficiency depends only on p.
(c) Show that this efficiency is the same at 1 - P as at p.
(d) Show that this efficiency is 2/n when p = 0.5.
(e) Evaluate this efficiency at p = 0,01,0.05,0.10,0.25, and 0.50.
(f) Which value of p gives the largest efficiency? (See also Problem 38(c) ofehap. 2.)

8. Show that the efficacies of equivalent test statistics, as defined by (2.10), must be
asymptotically the same, but need not be identical in finite samples, even if Iln(O) is
defined as the exact finite-sample mean.
9. (a) For a sample of n independently, identically distributed observations with
density f(x; 0), find the limiting efficacy per observation of the sample median
by direct calculation, using the fact that the median is asymptotically normal
with mean equal to the population median jl(O) and variance 1/4nf2[J1(O); 0].
*(b) Find the efficacy of the number of negative observations and venfy that It
agrees with the limiting efficacy in (a).
(Hmt: Show that F[Il(O); 0] = 0.5 and find 11'(0) by differentiating.)

10. Use tests at level (X = 0.50 to argue that the sample mean and the t statistIc have
asymptotically the same efficacy in samples from a distribution with finite variance.

11. Show that application of the method of Sect. 3.2 to the Wilcoxon signed-rank test
for an assumed center of symmetry 11 gives the median of the Walsh averages as an
estimator of 11 with the same asymptotic efficacy.
414 8 Asymptotic Relative Efficiency

12. In the two-sample shift problem, apply the method of Sect. 3.2 to obtain estimators
that have the same asymptotic efficacy as
(a) The two-sample t test.
(b) The two-sample median test.
(c) The two-sample rank-sum test.
13. Derive Equation (3.6) by expanding the right-hand side of (3.4) in a Taylor's series
about III. n(O), or the left-hand side of (3.5) about O.
* 14 (a) Consider a family of normal distributIOns with mean II and standard deviation a
that are indexed by the 90th percentile value 0 and a. Given that a = 1, how
would tests and estimators for 0 be based on the sample mean, the sample
median, or the 90th percentile of the sample?
(b) What would be the asymptotic efficaCies and relative efficiencies of the tests
and estimators in (a)?
(c) Consider the following three possible defimtions of 0 for arbitrary distributions:
(i) 0 = /1 + 1.645a
(ii) 0 = ~o 5 + 1.645a
(iii) 0 = ~o 9
where ~p is the quantile of order p in the distribution. Show that each of these
definitions agrees with the definition given for normal distributions with a = 1.
(d) How might tests and estimators based on the sample mean, the sample'median,
or the 90th percentile of the sample be developed for each of the extended
definitions of 0 in (c)?
15. (a) Derive the approximate c.dJ. (4.2) of a confidence bound T,. from the approxi-
mate power (4.1) of the corresponding test.
(b) Show using (a) that the c.d.f.'s of two confidence bounds Tl,n and T2 • n are
approximately the same except for a scale factor I/)E, 2(0).
16. Let L(t - /1,0) be the "loss" lI1curred when the confidence bound t is given for the
parameter /1(0). Show that, in the situations of Sect. 4,
(a) The distributIOn of L(T'.n - /1(0), 0) is approximately the same as that of
L{[T2 • n - 11(0)]IJE1:iO), OJ.
(b) If L(z, 0) is homogeneous of degree k in z, that is, L(az, 0) = akL(z, 0), then
Eo{L[T1.n - /1(0), OJ} is approximately [E1:2(0)r k,2 Eo{L['f:,.n - 11(0), OJ}.
(c) Apply (b) to L(z, 0) = z and L(z, 0) = max(z,O).
*(d) Show that if oL(z, O)loz exists and does not vanish at z = 0, then the same
conclusion as for k = 1 in (b) holds.
(e) If T'.n and T2 • n are estimators rather than confidence bounds, what changes are
necessary in (a)-(d)?
(f) Apply (b) for estimators to the loss functIOns
Izl, Z2, and c max(z, 0) + d max( -z, 0).
17. Show that the interpretation of asymptotic relative efficiency in terms of sample
sizes applies also to confidence bounds.
18. If T;. II and T;: n are respectively lower and upper confidence bounds for /1(0), i = 1,2,
and if the asymptotic relative efficiencies of T'I,II with respect to T 2.", and of T';."
with respect to T~.n' are both EI 2(0), show that
Eo(T'1.n - Ti.n) = Eo(T~.n - T 2.n)/JE 1.2 (0)
asymptotically.
Problems 415

19. For a sample from a population with media II and a density positive and contmuous
at /-I, show that the sample median is asymptotically normal with mean /-I and vari-
ance 1/(4I1d 2 ), where d is the density at /-I.
20. For the Laplace density (5.l),
(a) Show that the mean is (} and the variance is 2,1.2.
(b) Show that the efficacies of the sample median and the sample mean are as
given in (5.2) and (5.3) respectively.
21. For a sample of II from a density of the formf[(x - O)/A],
(a) Show that the efficacy of the normal-theory procedures satisfies en(O; A) =
llel(O; 1)/A 2 and in particular is independent of 0 and inversely proportional
to A2 •
(b) Show that the results in (a) hold also for the median procedures.
(c) Show that the asymptotic relative efficiency of the median procedures relative
to the normal-theory procedures is independent of both 0 and Aand depends only
on f.
(d) Generalize the results in (a)-(c).
*22. Show that the double exponential power density (5.6) approaches the uniform
density on (0 - A, (} + A) as k ..... 00.
*23. (a) Find the efficacies of the sample median and mean and their asymptotic relative
efficiency for the double exponential power density (5.6).
t
(b) Show that the asymptotic relative efficiency in (a) approaches as k ..... 00.
(c) Show that the result in (b) agrees with that of a direct calculation for the uniform
distribution.
24. Show that (6.1)-(6.4) are equivalent, that is, that (6.3) and (6.4) are equivalent to
uniform convergence.
25. How might the various properties of asymptotic relative efficiency be restated for
the case where it is 0 or oo?
26. In typical testing situations as described in Section 2, show that
(a) The powers P1.,,(00 + 0,,) and P 2 ,,,(Oo + onfi) appearing in (6.4) approach
limits other than 1, Ct., or 0 if fio" ..... d of. 0, 00, and E > O.
(b) If E of. E 1 :2 (00), the limits in (a) are different except in the situation mentioned
in Sect. 6.2.
27. Restate the uniform convergence conditions (6.5)-(6.9) in the style of (6.3) and (6.4).
28. Rescale the variables in the uniform convergence conditions (6.5)-(6.9) in such a
way that the terms have nondegenerate limits and uniformity is no longer essential
to the meaning. What are the limits? State the corresponding properties of asymp-
totic relative efficiency in terms of these limits.
29. State and justify converses to the properties of asymptotic relative efficiency,
according to which each property determines the asymptotic relative efficiency
uniquely.
30. Let T1 ." and T2 ." be estimators of the same quantity II = 11(0). If fi(T1." - It) has
a nondegenerate limiting distribution and property B(i) holds, then fi(T2 ,n - /-I)
has the same non degenerate limiting distribution except for the scale factor
I/JE~'2(O). Show this.
416 8 Asymptotic Relative Efficiency

*31. (a) Give an example of an estimator Tl,n and a population distribution such that
TI, n has infinite variance for every 11 but its limiting distribution has finite
variance.
(b) Can the variance of the limiting distributIOn exceed the limit (or lim inf or
lim sup) of the variance of ~ TI • n ?
32. (a) Show that Property A(ii) implies that, for all D, the difference between the two
powers PI.".(0o + J~IO) - P 2 .",({)o + j,~O) approaches zero if 112/111 -> E.
ThiS property is sometimes taken as the definitIOn of asymptotic relative
efficiency.
*(b) Show the converse of (a) under the additional condition that the power functions
Pl,n(O) and P 2 • n(0) are both monotone functions of 0 for all 11.
(c) Give statements analogous to (a) and (b) for estimators.
(d) Give statements analogous to (a) and (b) for confidence bounds.
33. (a) For one-tailed tests, choose dn such that the powers P l.n(OO + t) and P 2, n(OO +
dll t) have the same derivative with respect to tat t = O. Argue that dn->J E I z{(0)
as 11 -> 00.
(b) For unbiased two-tailed tests, choose dn such that the powers in (a) have the
same second derivative with respect to t at t = O. Argue that dn -> EI'z{Oo) as
11 -> 00.

(c) Interpret these results.


34. Show that the efficacy of a shift family of procedures under a shift model is constant.
35 For a sample of 11 observations on a shift family f(x; (}) = h(x - 0), denve the
efficacy of
(a) The sample mean.
(b) The sample median when h has median O.
(c) The Wilcoxon signed-rank test statistic when h IS symmetric about zero.
J
(d) Interpret h2(X)dx in terms of the density of a sum or difference of random
vanables with density h.
36. Define Q as the quantity in braces in either line of Equation (8.8) so that the asymp-
totic efficacy of procedures based on sums of signed constants is e4 = 4Q2/J6 c2(u )du.
(a) Show that Q has the alternative expressIOns

Q = c(O)Jz(O) + 2fl c'(2u - I)h[H- I (u)]dll


1/2

= c(O)Jz(O) + 2 E" c'[2H(x) - 1]Jz2(x)dx

if c[2H(x) - l]Jz(x) -> 0 as x -> 00.


(b) Use the result In (a) to denve the asymptotic efficacy ofthe median and Wilcoxon
signed-rank procedures as given in Equations (8.3) and (8.5).
(c) Use the result in (a) to derive expressions for the asymptotic efficacy of the
normal scores and squared rank procedures.
37. For the ul1lform shift famrIy,j(x - 0) = 1/2R for Ix - 01 < R, denve the efficacy
and, under (8.7), the asymptotic efficacy of the procedures based on sums of signed
constants. Verify the results given in (8.9) and (8.10).
Problems 417

38. (a) Derive the asymptotic efficacy of the normal scores procedures given m (8.11)
for a symmetric shift.
(b) Show thatthe result in (a) can also be written as e5 = UM<I>-l '(1)/ H- l'(l}}dl]2.
39. Determme the effect of a change of scale in the model on the efficacies of the pro-
cedures in Sect. 8.2.
*40. (a) Show that the double exponential power density with the value of a specified
in Table 8.1 approaches t for Ix I < 1 and 0 for Ix I > 1 as k -+ 00.
(b) What happens to the limit in (a) at x = I? Is it relevant?
(c) What happens to the limit in (a) if a is kept fixed as k -+ w?
*41. (a) Show that the symmetric beta density with the value of a specified in Table 8.1
approaches a normal density for all x as ,. -+ w.
(b) What happens to the limit in (a) if a is kept fixed as ,. -+ w?
42. Show that the logistic density appearing in Table 8.1 can be written in the alternative
forms
IJ(x) = a/2(l + cosh ax) = a/4 cosh2(ax/2)
where cosh z = (e + e-
Z Z )/2.
43. For the shift family f(x; 6) = h(x - 6), show that the Fisher information is
/(0) = E[h'(X)/IJ(X)]2 = - E[cfJ"(X)]

where X has density hand cfJ(x)= log h(x). (See also Problem 2.)
44. Let Un be the sample mean and let T" = Un if I Un I > n - 1/4 and T" = 0 otherwise.
For a population with mean ~ and finite variance (J2,
(a) Show that T" is asymptotically N(~, (J2/n) if Jl =f. O.
(b) What is the asymptotic distribution of T" if ~ = O?
(c) What is the asymptotic efficacy of T" ? Is it continuous at ~ = O?
"'45. For the double exponential power density h(x) of Table 8.1, show that
(a) (J2 = r(3/k)/a 2 r(l/k) and the asymptotic efficacy of the mean procedures is
1/(J2.
(b) h(O) = ak/2r(l/k) and the asymptotic efficacy of the median procedures is
[ak/r(l/k)Y
J
(c) IJ 2 (x)dx = ak/2 1+ (I/k)r(l/k) and the asymptotic efficacy of the Wilcoxon
procedures is 3[ak/21/kr{1/k)]2.
(d) / = a2 k 2 r(2 - I/k)/r(l/k).
(e) The entries in Table 8.1 corresponding to (a)-(d) are correct.
(f) The asymptotic efficiencies of the mean and Wilcoxon procedures relative to
the median procedures both approach 3 as k -+ 00.
46. For the normal scores procedures,
(a) Show that the asymptotic efficacy is 2/n for the Laplace distribution and 1 for
the standard normal distribution.
(b) Verify the entries in Table 8.1 corresponding to (a). (Note that the variance of
the normal density there is nI2.)
*47. For the one-sample squared rank procedures,
(a) Show that the asymptotic efficacy is i for the Laplace distribution and
80(arctan J!YIn 3 (with arctan III radians) for the standard normal distributIOn.
418 8 AsymptotIc RelatIve EfficIency

(b) Verify the entries in Table 8.1 corresponding to (a). (Note that the variance of
the normal density there is n/2.)
48. For the uniform distribution on [ -1, 1], verify the asymptotic efficacies given in
Table 8.1.
*49. For the symmetric beta density hex) of Table 8.1, show that
(a) (12 = 1/(2r + l)a 2 and the asymptotic efficacy of the mean procedures is
(2r + l)a 2 •
(b) h(O) = a/2 2r - 1 B(r, r) and the asymptotic efficacy of the median procedures is
[a/4r - t B(r, r)Y
(c) f h2(X)dx = aB(2,. - 1, 2r - 1)/2B 2(,., r) and the asymptotic efficacy of the
Wilcoxon procedures is 3a 2B2(2r - 1,2,. - 1)/B4 (,., r).
(d) I = a2 (,. - 1)(2r - 1)/(r - 2) for r > 2. What happens if r :5; 2?
(e) The entries in Table 8.1 corresponding to (a)-(d) are correct.

r
50. For the one-sample squared rank procedures for shifts, show that

r
(a) The asymptotic efficacy IS

20[(' [2H(x) - 1]2h'(x)dx = 320{L'Xl [2H(x) - l]h 2(x)dx

*(b) The result in (a) for the symmetric beta density of Table 8.1 with a = 1 is
5(33/32)2 if r = 2; 320(2/n)6(n/3 + 1/5 - 7/W if r = 1.5; 5(283.15/7.2 9 )2 if
r = 3; 5(187.175/3.2 13 )2 if r = 4.
*(c) Verify the entries in Table 8.1 corresponding to (b).
*51. For the cumulative logistIc distribution H(x) = 1/(1 + e- X ), show that (Hint:
The relation hex) = H(x)[1 - H(x)] and the substitution y = H(x) may sometimes
be helpful.)
(a) (12 = n 2/3 and the asymptotic efficacy of the mean procedures is 3/n 2.
(b) h(O) = i and the asymptotic efficacy of the median procedures is t.
(c) f h2(X )dx = i and the asymptotic efficacy of the Wilcoxon procedures is t-
(d) The asymptotic efficacy of procedures based on sums of signed constants
satisfying (8.7) is e4 = [fA uc(u)du]2m c2(u)du.
(e) For c(u) = u2, e4. = -h.
(f) The efficacy of the normal scores procedures is l/n.
(g) 1= t.
(h) The entries in Table 8.1 corresponding to (a)-(g) are correct.
52. For the Cauchy density hex) = l/n(1 + x 2), show that
(a) (12 = 00 and the asymptotic efficacy of the mean procedures is O.
(b) h(O) = l/n and the asymptotic efficacy of the median procedures is 4/n2.
(c) f h2(X)dx = 1/2n and the asymptotic efficacy of thc Wilcoxon procedures IS
3/n 2 •
*(d) The asymptotic efficacy of procedures based on sums of signed constants
satisfying (8.7) is

'e4. = [fC(U)Sin(nU)dUT If c 2(u)du.

*(e) For c(u) = u 2, e4. = (5/n2)(1 - 4/n2)2.


*(f) I = 1-
(g) The entries in Table 8.1 corresponding to (a)-(f) are correct.
Problems 419

53. Derive Equation (8.17) for the conditional probability, given the absolute values of a
sample, that the jth smallest belongs to a positive observation.
54. Show that the asymptotic efficiency of the normal-theory procedures relative to
the median procedures for shifts h(x - 8) can be infinite even if h is symmetric,
but cannot exceed 3 if h has its maximum at its median.
55. Justify the bounds given in Sect. 8.3 for the asymptotic etTIciency of the Wilcoxon
procedures relative to the median procedures.
56. Show that the asymptotic efficiency of the normal scores procedures relative to the
median procedures can be arbitrarily small for shifts of a symmetric density with a
spike of height l/e and width e2 at the median.
*57. (a) Show that the asymptotic efficiency of the Wilcoxon procedures relative to the
normal scores procedures for shifts of a symmetric density h cannot exceed 6/rt.
(Hmt: Use the second expression in (8.11).)
(b) Show that the value in (a) is approached for a suitable symmetric unimodal
density with a spike of height 1/e2 and width e3 at the median.
*58. (a) If H is a c.dJ. with finite variance, show that xl/>{<I>-l[H(x)]} ..... 0 as x ..... ± 00,
where <I> and I/> are the standard normal cumulative and density functions.
(Hint: Tchebycheff's inequality can be used [Gastwirth and Wolff, 1968].)
(b) If H is continuous, show that {J <I>-1[H(x)]h(x)dx}2 = 1.
59. (a) Show that E(l/Z) ;::: l/E(Z) if the random variable Z is positive with prob-
ability one. (One proof uses Jensen's inequality; another uses (8.27) or a similar
inequality. For others, see Gurland [1967].)
*(b) Use the result in (a) instead of (8.27) to show that the asymptotic efficiency of
the normal scores procedures relative to the normal-theory procedures for
symmetric shifts is at least one [Gastwirth and Wolff, 1968].)
*60. Derive the locally most powerful signed-rank test given by (8.33) for a symmetric
shift family.
*61. (a) Show that the test based on the sum of signed constants (8.33) satisfies (8.7) with

(b) Verify that substitution of the result in (a) in (8.8) gives e4. = I.
*62. Derive the expression (8.37) relating the efficacy of the signed-rank test given by
(8.33) to the Fisher information.
63. Let the jomt density of (V, W) belong to a shift family g(v, w; 0) = g(v, w - 8; 0)
and let X = W - V.
(a) Express the density of X in terms of g and show that it can be written as a shift
family f(x; 8) = h(x - 8).
(b) Show that every univariate shift family of densities can arise as in (a).
(c) If (V, W) is permutationally symmetrically distributed when 8 = 0, show that
X is symmetrically distributed about 8.
(d) Show that every symmetric density h can arise in the manner of (c).
(e) What role does the restriction to densities play in (a)-(d)?
420 8 Asymptotic Relative Efficiency

64. Let fjJ be the characteristic function of X. Show that


(a) If V and Ware independently, identically distributed and X = W - V, then fjJ
is real and nonnegative everywhere.
(b) If X has the rectangular distribution on [ -I, I], then fjJ(t) = (S111 t)/t, which can
take on negative values.
*(c) If X has the quadratic density h(x) = 3(1 - x 2t /4 (symmetric beta with
r = 2, a = 1), then fjJ(t) = 3(sin t - t cos t)/t 3 , which can take on negative
values. (See Corollary 4.6 of Hollander [1967].)
*(d) If X has the symmetric beta density with r = 3, a = 1, then
fjJ(t) = 15[(3 - t2)sin t - 3t cos t]/t 5 ,
which can take on negative values.
(e) fjJ(t; r) = J6 (1 - yr 2 sin(tJy)dy/tB(1.5, r - 1) for r > 1, where fjJ(t; r) is the
characteristic function of the symmetric beta density with a = 1 for arbitrary r.
*(f) If fjJ(t; r) < 0 for some t E (0, 2n] and some r > I, then fjJ(t; s) < 0 for thc same t
and all S E (I, r]. (Hint: Use the result 111 (e) and the fact that (I - yy- 2/(1 - y)S- 2
is decreasing in y while sin(tJy) changes sign at most once for y E [0, 1J.)
65. Let V and W be independently, identically distributed with central moments II),
and let X = W - V have central moments 8J ,j = 1,2, .... Show that
(a) O} = 0 for j odd, 82 = 2112, and 84 = 2114 + 611~.
(b) 84 ~ 28~.
(c) X cannot have a rectangular distribution because this would contradict (b).
(d) 06 = 2116 + 30J12114 - 20/d and Os = 211s + 56112116 - 1I2113J15 + 70,d.
*(e) X cannot have a symmetric beta distribution with,. = 2 because this would
contradict known inequalities among the moments.
66. Under the matched pairs model with independent errors leading to (9.5), show that
(a) The efficacy of the mean of the differences X} is n/2q2, where q2 is the variance
of the errors.
(b) The efficacy of the median of the differences X) is 4nU q2(x)dx] 2, where q is
the density of the errors.
67. Under the matched pairs model with independent errors leading to (9.5), show that
(a) The bounds of Table 8.2 on asymptotic relative efficiencies of one-sample
procedures which can be achieved by normal and Cauchy distributions cannot
be improved.
(b) The maximum asymptotic efficiency of the normal-theory procedures relative
to the median procedures is 125(72, achieved when q is the symmetric beta
density with r = 2. (Hint: Use Problem 66.)
*(c) The Wilcoxon and normal scores procedures can have arbitrarily small
asymptotic efficiency relative to the median procedures. (Mixtures of normal
distributions can be used for q.)
*(d) The asymptotic efficiency of the normal scores procedures relative to the
Wilcoxon procedures must be more than but can be arbitrarily close to n/6.
(Mixtures of normal distributions can again be used for q.)
(e) In each case (a)-(d), q is or can be chosen symmetric and unimodal.
68. (a) Give a two-sample form of the development leading to (2.11).
(b) Which steps in the development in (a) are exact for two-sample normal-theory
tests of equality of population means when the variances are known but possibly
unequal?
Problems 421

69. Let X" ... , Xm have c.dJ. F and Y" . .. , y" have c.dJ. G; let Z,' Z 2' ... be indepen-
dently, identically distributed according to H which is symmetric about 0; and
define
the number of lj < y if y < 0,
nG*(y) = {
" the number of lj ~ y if y ;;:: o.
Show that
(a) The two-sample Wilcoxon test statistic is equivalent to L Fm( lj), which
approaches L, F(lj) as m --> 00.
(b) The one-sample Wilcoxon test statistic for the Y sample is equivalent to
LJ [G:( lj) - G:( - lj)].
(c) For shifts of H, each of the statistics in (a) and (b) has the same asymptotic
behavior as a linear function of L H(Zj + 0).
(d) A two-sample sum of scores test statistic satisfying (10.6) is asymptotically
equivalent to L c[F(lj)] as first m --> 00 and then n --> 00.
(e) A one-sample sum of scores test statistic satisfying (8.7) is asymptotically
equivalent to LJ c[G:(lj) - G:( -lj)] as n --> 00.
(f) In the situation of (c), the sum in (d) is distributed as L c[H(Zj + 0)] while
the sum in (e) IS distributed as L c[2H(Zj + 0) - 1].
(g) With c = C(I) or c(2) as appropriate, the sums in (d)-(f) are the same under
condition (10.8).
*70. For each of the two-sample normal-theory, median, and Wilcoxon procedures
for shift alternatives (10.4),
(a) Derive directly the asymptotic efficacy.
(b) Verify that the efficacy does not depend on the ratio of the sample sizes and
agrees with the corresponding one-sample efficacy.
71. For a two-sample procedure based on a sum of scores,
(a) Derive the efficacy (10.5).
(b) Derive the asymptotic efficacy (10.7).
72. (a) Show that the one- and two-sample median and Wilcoxon procedures have
corresponding scores functions in the sense of (10.8).
(b) Derive the two-sample procedure that corresponds similarly to the one-
sample squared ranks procedure.
73. Problems 73-78 all concern the asymptotic robustness of two-sample location
tests against inequality of scale. See also Pratt [1964]. Consider a test at level a
for the null hypothesis that the densitiesfand g, with c.dJ.'s F and G respectively,
are equal. Suppose, for definiteness, that a is one-tailed, and let K = 11>-I(a) be the
corresponding standard normal deviate. Assume that miN --> A(which now matters).
For the two-sample t test, show that
(a) The probability of rejection approaches 0 or 1 iff and 9 have different means
(and finite variances).
(b) The probability of rejection approaches I1>(Kd) if f and 9 have equal means,
where d = [A + (l - A)02]1/2/(l - A + A02)1/2 and 0 = var(Y)/var(X).
(c) For fixed A, the range of possible values of d is an interval with endpoints
(A -I - 1)'/2 and (A -I - 1)-1 /2.

*74. For the two-sample median test, show that results like those in Problem 73 hold
with medIans replacing means, d = (1 - A+ AO)/(1 - A + A02)1/2, and 0 = fig at
422 8 Asymptotic Relative Efficiency

the common median, and the endpoints of the range of dare [min(A, I - A)]1/2
and 1. (Hint: Use the corresponding confidence bound given in (3.6) of Chap. 5.)
*75. For the two-sample Wilcoxon test, show that results like those of Problem 73 hold
but PI = 1/2 replaces equal means, d = [12AP2 + 12(1 - A)P3 - 3r1 / 2 , and the

t
range of d has endpoints (3 - 31)-1/2 and (I - 1)1/2/(3A - 412)1/2 if 1::;; and the
same with 1 replaced by (1 - 1) otherwise, where PI' P2' and P3 are defined by
(4.12)-(4.14) of Chap. 5. Forthe range of d, use var(U)~ mn[3m - (m _1)2 /(n -1)]/12
If PI = t where U is the Mann-Whitney statistic defined in (4.5) of Chap. 5 [Birnbaum
and Klose, 1957]. (Hint: Equation (4.26) of Chap. 5 gives an upper bound for
var(U) which is achieved for constant Y.)
*76. In Problem 75, if J and g have the same center of symmetry and variances in the
ratio (J2 = var( y)/var(X), show that
(a) P2 = ! + (I/2n) arcsin [(J2/(02 + I)] for J and g normal, where the arCSIn IS in
radIans.
(b) P2 = ! + (J2/2«(J + 1)(2(J + I) for f and g the Laplace density (5.6).
(c) P2 = ! + (J2/12 if (J ::;; 1 and P2 = t-
1/6(J if (J ~ 1 for Jand g uniform.
(d) P3 is given by the same formulas but with (J replaced by 1/(J.
*77. For the two-sample normal scores test, show that results like those of Problem
73(a) and (b) hold but')' = 0 replaces equal means and
= [2A.J(F, G,1) + 2(1 - 1)J(G, F, 1 - 1)]-1/2
d

where y = J<I>-I[H(x)]J(x)dx, H(x) = AF(x) + (I - A)G(X), and

J(F, G, A) = ff
x<y
G(x)[1 - G(y)]<I>-I,[H(x)]<I>-I,[H(y)]J(x)J(y)dxdy.

*78. What are the implications of Problems 73-77 for confidence procedures?
79. (a) Derive the approximation (11.1) for the power of the one-sided, one-sample
Kolmogorov-Smirnov test against shifts of the uniform distribution.
(b) Derive the approximate power (11.3) of the median test against the alternative
in (a).
(c) Derive the asymptotic relative efficiency (11.4) of the one-sided Kolmogorov-
Smirnov test relative to the median test for uniform shift alternatives.
*(d) Find the limits of the asymptotic relative efficiency as (1. --+ 0 for fixed {J, and as
{J --+ 0 for fixed (1., and as {J --+ I - (1..
*(e) Show that (a) and (c) provide upper bounds for any symmetric, unimodal
shift alternative.
*(f) For what other alternatives do (a) and (c) provide upper bounds?
80. For the one-sided, one-sample Kolmogorov-Smirnov test, derive
(a) The lower bound (11.6) for its power.
*(b) The lower bound (11.7) for its asymptotic efficiency relative to the median test.
(Hint: Use (11.2) and (11.3.)
81. (a) LetJbe a density on (- 00, B) where B is finite and let ~o be its quantile of order
O. Let g be a density on (~o, (0) such that g(y) = J(y) for eo::;; y ::;; B. Show that
the power of a Kolmogorov-Smirnov test of the null hypothesis J against the
alternative g depends only on (J, and not onf, and in particular, it is the same for
any J as for a uniform distribution on (0, I).
*(b) What happens in (a) if no such finite limit B exists?
Problems 423

*82. For the one-sided, one-sample Kolmogorov-Smirnov test,


(a) Derive the lower bound (11.9) for its power.
(b) Derive the lower bound (ILl 0) for its asymptotic efficiency relative to the
median test.
(c) Show that the bound in (b) approaches 0 as p --> 1 - (x.
*83. (a) Letf(x) = t - e + l/eforlxl < e2 ,f(x) = t - efore 2 ::; Ixl::; l,andf(x) = 0
otherwise. Show that, for sufficiently small e, shifts of f(x) come arbitrarily
close to achieving the lower bounds (11.8)-(11.10).
(b) Find the hmtls of the lower bound (11.10) as (X --> 0 for fixed p, as p --> 0 for fixed ct.,
and as fJ -> I - ct..
*84. Find the limits of the improved lower bound (11.19) for the asymptotic efficiency of
the Kolmogorov-Smirnov test relative to the median test as ct. --> 0 for fixed p,
as fJ --> 0 for fixed ct., and as p --> 1 - ct..

85. Let h be a density with c.dJ. H and median O. Show that


(a) The following conditions are equivalent:
(i) h(x)/H(x) ~ 2h(0) for x ::; 0;
(ii) h[H- I (II)J/1i ~ 2h(0) for II ::; 1;
(iii) h(x)/H(x) is minimized for x ::; 0 by x = O.
(b) The followmg conditions are equivalent:
(iv) h(x)/H(x) is decreasing in x for x ::; 0;
(v) h[H-1(u)J/u is decreasing in u for u ::; t;
(vi) log H(x) is strictly concave in x for x ::; O.
(c) The following conditions are equivalent:
(vii) h'(x)/h(x) is decreasing in x for x ::; 0;
(viii) log hex) is strictly concave in x for x ::; 0;
(ix) h[H-1(u)J is strictly concave in u for u ::; t-
(d) (iv)-(vi) imply (i)-(iii).
*(e) (Vll)-(IX) Imply (Iv)-(vi). (One proof uses Cauchy's theorem of elementary
1
calculus. Another uses the theorem that log f(x, y)dy is a concave function of
x If log f(x, y) IS a concave functIOn of (x, y) [Brascamp and Lleb, 1976J)
(f) The relations in (b)-(e) hold If "decreasing" is replaced by "nonincreasing"
and "strictly concave" by "concave" throughout.

86. Show that [Pratt, 1981]


(a) The double exponential power distribution satisfies (i)-(ix) of Problem 85 if
k ~ I and none of these conditions otherwise.
(b) The symmetric beta distribution satisfies (i)-(vi) for all ,. and (vii)-(ix) If and
only if /' ~ I.
(c) The logistic distribution satisfies (i)-(ix).
(d) The Cauchy distribution satisfies (i)-(vi) but not (vii)-(ix).
*(e) The t distribution satisfies (i)-(vi) but not (viiHix).
t
(f) The density hex) = cos x for Ixl ::; n/2 and 0 elsewhere satisfies (1)-(IX).

*87. Show that the Laplace distribution achieves equality in the lower bounds (11.18)
and (l1.19).
88. Let nl and n2 be the sample sizes required to achieve a given p by a one-sided
normal-theory test at level ct./2 and by a two-sided normal-theory test at level ct.
for the one-sample shift problem.
424 8 AsymptotIc RelatIve Efficiency

(a) Give expressions that determine "l and "2 for alternatives near the null hypo-
thesIs.
(b) Show that 1lt/1l2 :::;; 1.02 asymptotically for (X :::;; 0.1 and p :::;; 0.74; also for
(X :::;; 0.05 and P : :; 0.85; also for other combinations of (X and p.

*89. For the Kolmogorov-Smirnov test with 'one-sided critical value c at level (X/2,
show that
(a) PG[inf(Fn - F) :::;; -c and sup(Fn - F) :::;; c] :::;; (X/2 if G(x) ~ F(x) for all x.
(b) Asymptotically the probability in (a) comes arbitrarily close to
(X/2 - PG[inf(Fn - G) :::;; -c and sup(Fn - G) ~ c]

and the inequality (11.27) comes arbitrarily close to equality for shifts of a
distributIOn which is umform on U7~ I (I, i + t) for sufficiently large m.
*90. Show that, in the notation of Problem 89, for G continuous, as II --+ 00

+ e-2((2i-1)c-.0]2 _
<Xl
--+ L e - 2((2i-l)c-(.-1)8I' 2e- 2•2(2c-O)2.
J= 1
Tables

425
426 Tables

Table A Cumulative Standard Normal Distribution


Each table entry IS the cumulauve probablhty of a standardized normal vanable z = (x - II)/a, nght tall from the
value of z to plus Infimty, and also left tall from mlllUS mfimty to the value of -z, for all P ,,;; 050 Read down
the first column to the correct first deCimal value of z, and over to the correct column for the second deCimal
value The number at the mtersecUon IS the value of P.

0.00 001 002 003 004 0.05 0.06 007 0.08 0.09

00 0 50000 49601 49202 48803 48405 48006 47608 47210 46812 46414
0.1 46017 45620 45224 44828 44433 44038 43644 43251 42858 42465
0.2 42074 41683 41294 40905 40517 40129 39743 39358 38974 38591
03 38209 37828 37448 37070 36693 36317 35942 35569 35197 34827
0.4 34458 34090 33724 33360 32997 32636 32276 31918 31561 31207

05 30854 30503 30153 29806 29460 29116 28774 28434 28096 27760
0.6 27425 27093 26763 26435 26109 25785 25463 25143 24825 24510
07 24196 23885 23576 23270 22965 22663 22363 22065 21770 21476
08 21186 20897 20611 20327 20045 19766 19489 19215 18943 18673
09 18406 18141 17879 17619 17361 17106 16853 16602 16354 16109

1.0 15866 15625 15386 15151 14917 14686 14457 14231 14007 13786
II 13567 13350 13136 12924 12714 12507 12302 12100 11900 11702
12 11507 11314 11123 10935 10749 10565 10383 10204 10027 09853
1.3 00 96800 95098 93418 91759 90123 88508 86915 85343 83793 82264
14 80757 79270 77804 76359 74934 73529 72145 70781 69437 68112

15 66807 65522 64255 63008 61780 60571 59380 58208 57053 55917
16 54799 53699 52616 51551 50503 49471 48457 47460 46479 45514
17 44565 43633 42716 41815 40930 40059 39204 38364 37538 36727
18 35930 35148 34380 33625 32884 32157 31443 30742 30054 29379
19 28717 28067 27429 26803 26190 25588 24998 24419 23852 23295

20 22750 22216 21692 21178 20675 20182 19699 19226 18763 18309
2I 17864 17429 17003 16586 16177 15778 15386 15003 14629 14262
22 13903 13553 13209 12874 12545 12224 11911 11604 11304 11011
23 10724 10444 10170 09903 09642 09387 09137 08894 08656 08424
24 00' 81975 79763 77603 75494 73436 71428 69469 67557 65691 63872

2.5 62097 60366 58677 57031 55426 53861 52336 50849 49400 47988
26 46612 45271 43965 42692 41453 40246 39070 37926 36811 35726
27 34670 33642 32641 31667 30720 29798 28901 28028 27179 26354
28 25551 24771 24012 23274 22557 21860 21182 20524 19884 19262
2.9 18658 18071 17502 16948 16411 15889 15382 14890 14412 13949

3.0 13499 13062 12639 12228 11829 11442 11067 10703 10350 10008
3.1 00 3 96760 93544 90426 87403 84474 81635 78885 76219 73638 71136
3.2 68714 66367 64095 61895 59765 57703 55706 53774 51904 50094
3.3 48342 46648 45009 43423 41889 40406 38971 37584 36243 34946
3.4 33693 32481 31311 30179 29086 28029 27009 26023 25071 24151
35 23263 22405 21577 20778 20006 19262 18543 17849 17180 16534
36 15911 15310 14730 14171 13632 13112 12611 12128 11662 11213
37 10780 10363 09961 09574 09201 08842 08496 08162 07841 07532
3.8 0.0' 72348 69483 66726 64072 61517 59059 56694 54418 52228 50122
39 48096 46148 44274 42473 40741 39076 37475 35936 34458 33037

40 31671 30359 29099 27888 26726 25609 24536 23507 22518 21569
41 20658 19783 18944 18138 17365 16624 15912 15230 14575 13948
42 13346 12769 12215 11685 11176 10689 10221 09774 09345 08934
43 00' 85399 81627 78015 74555 71241 68069 65031 62123 59340 56675
44 54125 51685 49350 47117 44979 42935 40980 39110 37322 35612
Tables 427

Table A (continued)
z 000 001 002 0.03 004 005 006 007 008 009

4.5 33977 32414 30920 29492 28127 26823 25577 24386 23249 22162
4.6 21125 20133 19187 18283 17420 16597 15810 15060 14344 13660
4.7 13008 12386 11792 11226 10686 10171 09680 09211 08765 08339
48 00 6 79333 75465 71779 68267 64920 61731 58693 55799 53043 50418
49 47918 45538 43272 41115 39061 37107 35247 33476 31792 30190

For larger values of z, P can be approximated by (21t)-1/2 Me- "/2 = 0.398942 Me-"/2 where

The error IS less than 0 0005P for z ~ I 2. See Pelzer and Pratt [1968].

Source Taken from Table III of Fisher and Yates Stallsllcal Tablesfor BIOlogical, Agricultural and Medical Research, pubhshed
by Longman Group Ltd., London (6th edition, 1974, page 45) (previously pubhshed by Ohver & Boyd, Ltd, EdlDburgh), by
permISSion of the authors and publishers
428 Tables

Table B Cumulative Binomial Distribution


Each table entry IS the left-tall bmomlal cumulative probability of s or less successes m II tflals with probability p of
success, for I $ s < 11/2,4 $ II $ 20 and p = 0.5,21 $ II $ 30. Each entry IS also the fight-tail cumulallve probability
of II - S or more successes for probability I - p, as given m the fight column and last row for II $ 20 For ~ = 0,

r
use P(S = 0) = (I - p)n For other values of s, consider the complementary tall For other values of p andlor II,
Table A can be used wllh the approximate standard normal deviate

Z=~~- d -~- (
{1211 s'
s'ln-+ t'ln-t')
Is' - IIpl 611 + I lip IIq

where s' = s + 1/2, t' = II - s - 1/2, d = s + 2/3 - (II + 1/6)p + 002 [ql(s + I) - pl(1I - s) + (q - 05)/(11 + I)]
The errorls less than I % of the smaller tall probability If~, II - s - I ~ 2 and 0 19 $ s'qlt'p $ 53 It IS less than 02%
If S, II - S - I ~ 4 and 040 $ s'qlt'p $ 25 See Pelzer and Pratt [1968]. It IS less than 000082 If s, II - S - I ~ I,
than 000012 If S, II - S - I ~ 4 See Lmg [1978].

p 005 010 020 030 040 050 060 070 080 090 095

II n-s
4 I 09860 09477 0.8192 06517 04752 03125 01792 00837 00272 00037 00005 3
5 I 0.9774 09185 07373 05282 03370 01875 00870 00308 00067 00005 00000 4
2 09988 09914 09421 08369 06826 05000 0.3174 0.1631 00579 00086 00012 3

6 I 09672 08857 06554 04202 0.2333 01094 00410 0.0109 0.0016 0.0001 00000 5
2 09978 09842 09011 07443 05443 03438 01792 0,0705 0.0170 00013 00001 4

7 I 09556 08503 0.5767 0.3294 0.1586 00625 00188 00038 00004 0.0000 0.0000 6
2 09962 0.9743 08520 06471 0.4199 0.2266 00963 0.0288 00047 00002 00000 5
3 09998 0.9973 09667 08740 07102 05000 0.2898 01260 0.0333 0.0027 0.0002 4

8 I 09428 08131 0.5033 02553 0.1064 0.0352 00085 00013 00001 0.0000 0.0000 7
2 09942 09619 0.7969 0.5518 0.3154 01445 00498 0.0113 0.0012 0.0000 00000 6
3 09996 09950 09437 08059 0.5941 03633 01737 0.0580 00104 00004 00000 5

9 I 09288 07748 0.4362 01960 00705 0.0195 00038 00004 00000 0.0000 0.0000 8
2 09916 0.9470 0.7382 04628 02318 00898 00250 0.0043 0.0003 00000 00000 7
3 0.9994 09917 0.9144 07297 0.4826 02539 0.0994 00253 00031 00001 00000 6
4 10000 09991 09804 09012 07334 05000 0.2666 00988 0.0196 00009 00000 5

10 09139 07361 0.3758 0.1493 0.0464 0.0107 00017 00001 00000 0.0000 0.0000 9
2 09885 09298 0.6778 0.3828 01673 0.0547 0.0123 00016 0.0001 0.0000 0.0000 8
3 09990 09872 08791 0.6496 0.3823 0.1719 0.0548 00106 00009 0.0000 0.0000 7
4 09999 09984 0.9672 0.8497 0.6331 0.3770 0.1662 00473 00064 00001 00000 6

II I 08981 06974 03221 01130 0.0302 0.0059 00007 00000 00000 00000 00000 10
2 09848 09104 0.6174 0.3127 01189 00327 0.0059 00006 00000 0.0000 00000 9
3 09984 09815 08389 05696 02963 01133 00293 00043 00002 00000 0.0000 8
4 09999 09972 0.9496 0.7897 05328 02744 00994 00216 00020 00000 00000 7
5 10000 09997 09883 09218 07535 05000 02465 00782 00117 00003 0.0000 6

12 I 08816 0.6590 02749 0.0850 0.0196 00032 00003 00000 00000 00000 00000 II
2 09804 0.8891 05583 0.2528 00834 0.0193 00028 00002 00000 00000 00000 10
3 09978 09744 0.7946 04925 02253 00730 0.0153 00017 0.0001 00000 00000 9
4 0.9998 0.9957 0.9274 07237 04382 0.1938 00573 0.0095 0.0006 00000 00000 8
5 10000 09995 09806 0.8822 06652 0.3872 01582 0.0386 00039 00001 00000 7

13 I 08646 06213 02336 00637 00126 00017 00001 00000 0.0000 0.0000 0.0000 12
2 09755 08661 0.5017 02025 00579 00112 0.0013 00001 0.0000 00000 00000 II
3 09969 09658 07473 04206 01686 00461 00078 0.0007 00000 00000 00000 10
4 09997 09935 09009 06543 03530 01334 0.0321 00040 00002 0.0000 00000 9
5 10000 0.9991 09700 08346 05744 02905 00977 00182 00012 00000 00000 8
6 10000 09999 09930 09376 07712 05000 02288 0.0624 0.0070 00001 00000 7

I-p 0.95 0.90 0.80 0.70 0.60 050 0.40 030 020 010 0.05
Tables 429

Table B (continued)
p 005 0.10 020 030 040 050 060 070 080 090 095
/I " -s
14 08470 05848 01979 00475 00081 00009 00001 00000 00000 00000 00000 13
2 09699 08416 04481 01608 00398 00065 00006 00000 00000 00000 00000 12
3 0.9958 09559 06982 03552 01243 00287 00039 0.0002 0.0000 0.0000 00000 II
4 09996 0.9908 08702 05842' 02793 0.0898 00175 00017 00000 00000 00000 10
5 10000 09985 09561 0.7805 04859 02120 00583 00083 00004 00000 00000 9
6 10000 09998 09884 09067 06925 03953 01501 0.0315 00024 00000 00000 8

15 I 08290 05490 01671 00353 00052 00005 0.0000 00000 00000 00000 00000 14
2 09638 08159 03980 01268 00271 00037 00003 00000 00000 00000 00000 13
3 09945 09444 06482 02969 00905 00176 00019 00001 00000 00000 00000 12
4 09994 09873 08358 05155 02173 0.0592 0.0093 0.0007 00000 00000 00000 II
5 09999 09978 09389 07216 04032 01509 00338 00037 00001 00000 00000 10
6 10000 09997 09819 08689 06098 03036 0.0950 0.0152 0.0008 0.0000 0.0000 9
7 10000 10000 09958 09500 07869 0.5000 02131 00500 00042 00000 0.0000 8

16 I 08108 05147 01407 00261 00033 00003 00000 0.0000 0.0000 00000 00000 15
2 09571 07892 03518 00994 0.0183 0.0021 0.0001 0.0000 00000 0.0000 0.0000 14
3 09930 09316 0.5981 02459 0.0651 00106 00009 0.0000 00000 0.0000 00000 13
4 09991 09830 07982 04499 01666 00384 00049 00003 00000 0.0000 0.0000 12
5 09999 09967 09183 06598 0.3288 01051 00191 0.0016 0.0000 0.0000 00000 II
6 10000 09995 09733 08247 0.5272 02272 00583 00071 00002 0.0000 00000 10
7 10000 09999 09930 09256 07161 04018 01423 0.0257 00015 00000 00000 9

17 I 07922 0.4818 01182 00193 00021 0.0001 0.0000 0.0000 00000 00000 00000 16
2 09497 0.7618 03096 00774 00123 0.0012 0.0001 0.0000 00000 00000 00000 15
3 09912 09174 05489 0.2019 0.0464 00064 0.0005 00000 0.0000 0.0000 0.0000 14
4 09988 09779 0.7582 03887 0.1260 0.0245 0.0025 00001 0.0000 0.0000 00000 13
5 09999 09953 08943 05968 02639 00717 0.0106 0.0007 0.0000 0.0000 00000 12
6 10000 09992 09623 07752 04478 01662 00348 0.0032 00001 0.0000 0.0000 11
7 10000 09999 09891 08954 0.6405 03145 00919 00127 00005 00000 00000 10
8 10000 10000 09974 09597 08011 05000 0.1989 0.0403 00026 0.0000 00000 9
18 I 07735 04503 00991 0.0142 00013 0.0001 0.0000 00000 00000 00000 00000 17
2 09419 07338 02713 00600 00082 00007 00000 00000 00000 00000 0.0000 16
3 09891 09018 05010 01646 00328 00038 00002 0.0000 0.0000 0.0000 00000 15
4 09985 09718 07164 03327 00942 00154 00013 00000 00000 00000 0.0000 14
5 09998 09936 08671 05344 02088 0.0481 00058 00003 0.0000 0.0000 0.0000 13
6 10000 09988 09487 07217 03743 0.1189 0.0203 00014 00000 00000 00000 12
7 10000 09998 09837 08593 05634 02403 00576 00061 00002 0.0000 00000 11
8 10000 10000 09957 09404 07368 04073 01347 00210 0.0009 0.0000 0.0000 10

19 I 07547 0.4203 00829 00104 o0008 ,0 0000 00000 00000 00000 0,0000 00000 18
2 0.9335 0.7054 0.2369 0.0462 00055 00004 00000 00000 00000 00000 00000 17
3 09869 08850 04551 01332 00230 00022 00001 00000 00000 00000 00000 16
4 0.9980 0.9648 0.6733 02822 0.0696 0.0096 00006 0.0000 0.0000 0.0000 00000 15
5 09998 0.9914 0.8369 04739 01629 00318 0,0031 0.0001 00000 00000 00000 14
6 10000 09983 09324 06655 0,3081 0.0835 0.0116 00006 00000 00000 00000 13
7 1.0000 09997 09767 08180 04878 01796 00352 00028 00000 0.0000 00000 12
8 1.0000 1,0000 09933 09161 06675 03238 0,0885 00105 0.0003 00000 00000 II
9 10000 10000 09984 09674 08139 0.5000 01861 00326 00016 00000 00000 10

1-p 0.95 0.90 0.80 0.70 0.60 050 040 030 0.20 010 005
430 Tables

Table B (continued)
p 005 010 020 030 040 050 060 070 080 0.90 0.95

n n-s
20 1 07358 0.3917 00692 00076 00005 00000 00000 00000 00000 00000 00000 19
2 0.9245 06769 0.2061 00355 00036 00002 00000 00000 00000 00000 00000 18
3 09841 0.8670 04114 0.1071 0.0160 00013 0.0000 0.0000 00000 0.0000 0.0000 17
4 09974 09568 06296 02375 00510 00059 00003 00000 00000 00000 00000 16
5 09997 09887 08042 04164 01256 00207 00016 00000 0.0000 0.0000 0.0000 15
6 1.0000 09976 09133 0.6080 02500 0.0577 00065 0.0003 0.0000 0.0000 0.0000 14
7 10000 09996 0.9679 0.7723 0.4159 0.1316 00210 0.0013 00000 0.0000 00000 13
8 10000 09999 09900 08867 05956 02517 00565 00051 00001 00000 00000 12
9 10000 1.0000 0.9974 0.9520 07553 0.4119 01275 00171 00006 00000 00000 11

I-p 095 0.90 080 070 060 050 040 030 020 010 0.05

Supplementary Table for p = 0 5


S /I 21 22 23 24 25 26 27 28 29 30
I 0.0000 00000 0.0000 0.0000 0.0000 0.0000 0.0000 00000 00000 0.0000
2 00001 00001 0.0000 00000 00000 0.0000 00000 0.0000 0.0000 0.0000
3 00007 00004 00002 00001 00001 00000 00000 00000 00000 00000
4 00036 00022 00013 00008 00005 00003 00002 00001 00001 00000
5 00133 0.0085 00053 0.0033 00020 00012 0.0008 0.0005 0.0003 0.0002
6 00392 00262 0.0173 00113 00073 00047 00030 00019 0.0012 00007
7 00946 00669 00466 00320 00216 0.0145 00096 00063 00041 00026
8 01917 01431 01050 00758 00539 00378 00261 00178 00121 00081
9 03318 02617 02024 01537 01148 00843 00610 00436 00307 00214
10 05000 04159 03388 02706 02122 0.1635 01239 0.0925 0.0680 00494
II 06692 05841 05000 04194 03450 02786 02210 01725 01325 01002
12 08083 07383 06612 05806 0.5000 04225 03506 0.2858 0.2291 01808
13 09054 08569 07976 07294 06550 0.5775 05000 04253 0.3555 0.2923
14 09608 09331 0.8950 0.8463 0.7878 0.7214 0.6494 05747 05000 04278
Tables 431

Table C Binomial Confidence Limits


Each table entry IS II lImes the lower or upper confidence bound of the bInomIal parameter p for an observed
number" of successes In II tnals when II ~ 5 and '/11 < 050 at tYPIcal levels I - a For S/II > 050, enter the table
with (II - S)/II, Interchange Upper and Lower and subtract the table entry from II In either case, divIde the result
by II to obtam the confidence lImIt For S = 0, the lower lImIt IS 0 and the upper lImIt IS I - a"" for all II.
For, > 14, use the approximatIon mdlcated at the end of the table

Lower Tall a Upper Tall a


sjll 0005 0010 0025 0050 0100 0100 0050 0025 0010 0005

000 0005 0.010 0.025 0051 0.105 3.89 4.74 557 6.64 743
005 0005 0.010 0025 0.051 0.105 3.62 4.32 497 578 634
0.10 0005 0010 0025 0051 0105 3.37 394 445 504 544
015 0005 0010 0025 0.051 0.105 3.14 360 3.99 4.42 470
0.20 0005 0010 0025 0051 0104 292 329 358 389 407

2 000 0103 0149 0242 0.355 0.532 532 630 7.22 841 927
2 0.05 0105 0.150 0245 0358 0535 511 597 677 776 847
2 010 0.106 0152 0247 0361 0538 4.90 565 634 7.17 7.74
2 015 0.107 0154 0249 0364 0.542 469 535 594 662 708
2 020 0109 0155 0252 0.368 0545 4.50 507 5.56 6.12 648
2 025 0110 0157 0.255 0371 0549 431 480 5.21 5.65 594
2 2\7 0111 0159 0.257 0374 0552 417 4.61 497 535 558
2 2\6 0.112 0.161 0.260 0377 0556 4.00 437 466 4.96 514
2 040 0114 0163 0.264 0.382 0.561 377 405 4.27 447 4.59

3 0.00 0338 0436 0619 0818 I 102 668 775 877 10.05 10 98
3 010 0348 0.448 0.634 0834 1.119 628 7.16 796 893 961
3 020 0358 0461 0650 0853 I 138 5.89 660 721 793 8.41
3 030 0370 0475 0667 0873 1.158 5.52 607 6.52 703 735
3 040 0383 0491 0687 0895 I 181 515 556 589 6.22 642
3 0.50 0398 0508 0.709 0.919 1205 479 508 529 5.49 560

4 0.00 0.672 0823 1090 1.366 1.745 799 9.15 10 24 1160 1259
4 010 0693 0.847 1.117 1.395 1773 760 858 947 10 54 II 30
4 020 0715 0872 1147 1427 1804 721 8.02 8.73 957 1013
4 030 0740 0901 I 179 1462 1.838 6.83 7.48 804 866 906
4 040 0768 0932 1216 1500 1876 646 696 738 782 809
4 050 0799 0968 1256 1543 1917 608 646 674 7.03 720

000 1078 1279 1623 1970 2433 927 10 51 II 67 13 11 1415


5 010 1.111 1315 1664 2.012 2472 888 994 1091 1208 12.90
5 0.20 I 147 1355 1.708 2057 2515 8.49 939 10 18 1111 11 74
5 030 I Ii!? 1400 1756 2.107 2563 810 884 947 10 19 10 67
5 040 1232 1449 1.810 2.162 2.615 772 831 879 932 966
5 050 1.283 1504 1.871 2224 2.673 7.33 778 813 850 872

6 0.00 1.537 1.785 2.202 2613 3152 10.53 11.84 13.06 1457 15.66
6 0.10 I 583 1835 2.255 2.667 3202 10.14 11.27 1230 13.55 1444
6 0.20 1.634 1890 2.314 2726 3.257 9.74 10.71 11.57 1259 13.28
6 0.30 1.691 1.951 2.379 2.791 3.317 935 10.16 1086 11 66 12.19
6 040 1754 2.019 2450 2.863 3.384 8.95 961 10 16 1077 11.16
6 0.50 1826 2095 2531 2944 3.458 8.54 906 947 9.90 10.17

7 000 2.037 2.330 2.814 3285 3895 11.77 Ill5 1442 1600 17 13
7 0.10 2.098 2394 2881 3.352 3.956 11 37 1257 13.67 1499 15.92
7 0.20 2164 2.464 2.954 3.424 4.022 10 97 1201 12.93 14.02 1477
7 030 2238 2542 3035 3503 4094 10.56 11.44 1220 13.08 1367
7 0.40 2.320 2628 3124 3591 4.174 10 16 10 88 11.49 1217 1261
7 0.50 2414 2726 3.225 3690 4.264 9.74 10.31 10.77 11.27 11 59
432 Tables

Table C (continued)
Lower Tall a Upper Tall C(

s SIll 0.005 0.010 0.025 0.050 0.100 0100 0.050 0025 0.010 0.005

8 000 2571 2906 3454 3.981 4656 1299 1443 1576 1740 1858
8 010 2646 2.984 3534 4059 4727 12.59 1386 1501 1640 17.37
8 020 2.727 3.069 3621 4144 4804 1218 1328 1426 1542 1622
8 030 2818 3163 3717 4238 4888 1177 1271 13 52 1446 1510
8 040 2920 3268 3824 4341 4981 II 35 1213 1279 13 53 1402
8 050 3035 3387 3944 4458 5085 10 91 1154 1206 1261 1297

9 000 3132 3507 4115 4.695 5432 14.21 1571 1708 1878 2000
9 010 3.221 3599 4208 4785 5513 13 79 1512 16.32 17.77 18.80
9 020 3318 3699 4.309 4.883 5600 1338 1454 1557 1679 1763
9 030 3426 3810 4.420 4.990 5696 1296 1395 1482 1582 1650
9 040 3547 3933 4544 5 108 5801 1253 1336 1407 1487 1540
9 050 3684 4073 4683 5242 5919 1208 12.76 13.32 13.93 14.32

10 000 3717 4130 4.795 5425 6221 1541 1696 1839 2014 2140
10 010 3820 4235 4900 5526 6311 1499 1637 1762 1913 20.20
10 020 3932 4350 5015 5636 6409 1456 1578 1686 1814 1902
10 030 4057 4477 5.141 5756 6515 1413 1518 1610 1716 17.88
10 040 4197 4619 5281 5890 6632 1369 1458 1533 16.19 16.76
10 050 4355 4779 5439 6039 6.763 1324 1396 14.56 15.22 1565

II 000 4321 4771 5.491 6169 7021 1660 1821 1968 2149 2278
11 010 4.438 489Q 5608 6281 7120 1617 1761 1891 20.47 2157
II 020 4566 5019 5736 6402 7227 15.74 1701 18.14 19.47 20.39
II 030 4707 5162 5877 6535 7343 1530 1640 1736 18.48 19.23
11 040 4866 5.322 6033 6683 7472 1485 1579 1658 1749 1809
II 050 5045 5502 6209 6848 7616 1438 1515 1579 1650 16.96

12 000 4943 5428 6201 6924 7829 1778 19.44 2096 2282 24.14
12 010 5073 5560 6330 7.047 7937 1735 1884 2018 2180 2293
12 020 5216 5.703 6470 7179 8053 1691 1823 19.40 2079 2175
12 030 5374 5862 6625 7325 8180 1646 1761 1861 1978 20.57
12 040 5551 6039 6797 7486 8320 1600 1698 1782 1877 1941
12 050 5.751 6239 6.990 7666 8.476 15.52 16.33 17.01 1776 1825

13 0.00 5580 6.099 6.922 7.690 8646 18.96 2067 2223 24.14 25.50
13 010 5724 6.243 7.063 7.822 8762 1852 2006 21.44 23 II 24.28
13 020 5882 6401 7.216 7966 8887 1807 1944 20.65 22.09 2308
13 030 6056 6575 7.384 8124 9024 17.62 18.81 19.85 21.07 21.89
13 040 6250 6769 7.571 8298 9174 1715 18.17 19.04 2005 20.71
13 050 6471 6.988 7.781 8.493 9.342 16.66 17.51 18.22 19.01 19.53

14 000 6231 6.782 7.654 8.464 9.470 20.13 2189 2349 25.45 2684
14 010 6388 6.939 7.806 8.606 9.594 19.68 2127 22.69 2441 25.61
14 020 6560 7.111 7.972 8.761 9728 19.23 20.64 21.89 23.38 24.40
14 030 6750 7300 8.153 8.930 9874 1877 2000 21.08 2234 2320
14 040 6962 7.510 8355 9117 10.035 18.29 1935 2026 2131 22.00
14 0.50 7.203 7.748 8.581 9.327 10.214 17.79 18.67 19.42 20.25 20.80

Other values can be approxImated as follows For lower hmlts, let a = /I - s, b = S + I, Z = posItive normal
quantIle (Table A) For upper hmlts,let a = /1 - S + I, b = s, Z = negatIve normal quantIle Calculate A = 9a - I,
B = 9b - I, C = 3z, D = B2 - bC 2,E = AB + qaD + bA2)1i2 Theconfidencehmltls It[l + (b/a)2(E/D)3]
For C( ~ 0.005, the error is less than 1%if s, 11 - S ~ 9 and less than 0.5 %If s, 11 - S ~ 12. See Pratt [/968].
Tables 433

Table D Cumulative Probabilities for Wilcoxon Signed Rank Statistic


Each table entry IS the cumulallve probablhtyoft or less (Ieft-tail)undcr the null dIstribution ofT+(or equivalently T-).

I\n 5 6 7 8 9 10 I\n 8 9 10

0 00313 0.0156 00078 00039 00020 00010 14 0.3203 01797 00967


I 0.0625 0.0313 00156 00078 00039 00020 15 0.3711 0.2129 01162
2 00938 0.0469 00234 0.0117 0.0059 00029 16 0.4219 02480 01377
3 0.1563 0.0781 00391 0.0195 0.0098 0.0049 17 04727 02852 o 1611
4 02188 01094 00547 00273 00137 00068 18 05273 0.3262 01875
5 0.3125 0.1563 0.0781 0.0391 0.0195 00098 19 03672 02158
6 04063 0.2188 0.1094 0.0547 0.0273 0.0137 20 04102 02461
7 05000 0.2813 01484 0.0742 00371 00186 21 04551 02783
8 0.3438 01875 00977 0.0488 00244 22 05000 03125
9 04219 02344 01250 0.0645 00322 23 03477
10 0.5000 02891 01563 00820 00420 24 03848
11 03438 01914 01016 00527 25 04229
12 04063 0.2305 0.1250 00654 26 0.4609
13 04688 02734 01504 00801 27 05000

I\n II 12 13 14 15 16 17 18 19 20

0 00005 00002 0.0001 0.0001 0.0000 0.0000 00000 00000 00000 00000
I 00010 00005 00002 00001 00001 0.0000 0.0000 0.0000 00000 0.0000
2 00015 00007 00004 00002 00001 00000 0.0000 0.0000 0.0000 00000
3 0.0024 00012 00006 00003 00002 00001 00000 00000 00000 00000
4 00034 0.0017 00009 00004 00002 00001 00001 00000 00000 00000
5 00049 0.0024 00012 00006 0.0003 00002 0.0001 0.0000 0.0000 00000
6 0.0068 0.0034 00017 00009 00004 00002 00001 00001 00000 00000
7 00093 00046 00023 00012 00006 00003 00001 00001 00000 00000
8 00122 00061 00031 00015 0.0008 0.0004 0.0002 00001 00000 00000
9 0.0161 0.0081 00040 00020 0.0010 00005 00003 00001 00001 00000
10 00210 00105 00052 00026 00013 00007 00003 00002 00001 00000
11 00269 00134 00067 00034 00017 00008 00004 00002 00001 00001
12 00337 00171 0.0085 0.0043 0.0021 0.0011 0.0005 0.0003 00001 00001
13 0.0415 00212 00107 00054 00027 00013 00007 00003 00002 00001
14 00508 00261 0.0133 00067 00034 00017 00008 00004 00002 00001
15 00615 00320 0.0164 0.0083 00042 0.0021 0.0010 0.0005 00003 00001
16 0.0737 00386 00199 00101 00051 00026 00013 0.0006 00003 0.0002
17 00874 00461 00239 00123 00062 0.0031 00016 00008 0.0004 0.0002
18 0.1030 0.0549 00287 00148 00075 00038 00019 00010 00005 00002
19 01201 00647 00341 00176 00090 00046 00023 00012 0.0006 00003
20 01392 00757 0.0402 00209 00108 00055 0.0028 0.0014 0.0007 00004
21 0.1602 00881 00471 00247 00128 00065 0.0033 00017 00008 00004
22 01826 0.1018 00549 00290 0.0151 0.0078 00040 0.0020 00010 0.0005
23 02065 01167 0.0636 0.0338 0.0177 00091 00047 0.0024 00012 00006
24 0.2324 01331 0.0732 0.0392 00206 00107 00055 00028 0.0014 00007
25 0.2598 01506 00839 00453 00240 00125 00064 0.0033 00017 00008
26 0.2886 0.1697 0.0955 0.0520 00277 0.0145 00075 0.0038 0.0020 0.0010
27 0.3188 01902 01082 0.0594 0.0319 00168 0.0087 0.0045 0.0023 0.0012
28 03501 02119 01219 0.0676 00365 00193 00101 0.0052 0.0027 00014
29 0.3823 0.2349 0.1367 00765 0.0416 0.0222 O.oII6 0.0060 0.0031 0.0016
30 0.4155 02593 0.1527 0.0863 0.0473 00253 00133 0.0069 00036 00018
31 0.4492 02847 0.1698 0.0969 0.0535 0.0288 0.0153 00080 0.0041 00021
32 04829 0.3110 01879 01083 00603 00327 0.0174 0.0091 00047 00024
33 05171 0.3386 0.2072 0.1206 00677 00370 00198 00104 00054 00028
34 0.3667 02274 0.1338 00757 00416 00224 0.0118 00062 00032
35 0.3955 02487 0.1479 00844 0.0467 0.0253 0.0134 0.0070 0.0036
36 04250 02709 0.1629 00938 00523 00284 00152 00080 00042
37 0.4548 0.2939 0.1788 0.1039 0.0583 00319 0.0171 00090 00047
434 Tables

Table D (continued)
1\11 11 12 13 14 15 16 17 18 19 20

38 0.4849 0.3177 01955 0.1147 00649 0.Q357 00192 00102 00053


39 0.5151 0.3424 0.2131 0.1262 00719 0.0398 00216 0.0115 0.0060
40 0.3677 02316 01384 00795 00443 0.0241 00129 00068
41 0.3934 0.2508 0.1514 0.0877 00492 0.0269 0.0145 0.0077
42 0.4197 0.2708 0.1651 00964 0.0544 00300 00162 00086
43 04463 02915 0.1796 0.1057 0.0601 0.0333 00180 0.0096
44 04730 03129 0.1947 01156 00662 00368 00201 0.0107
45 0.5000 03349 02106 0.1261 00727 00407 0.0223 00120
46 0.3574 02271 0.1372 00797 00449 00247 0.0133
47 0.3804 02444 0.1489 00871 0.0494 0.0273 0.0148
48 04039 02622 01613 0.0950 0.0542 00301 0.0164
49 04276 02807 0.1742 01034 00594 0.0331 00181
50 04516 02997 01877 0.1123 00649 00364 00200
51 04758 03193 0.2019 01218 00708 00399 0.0220
52 05000 03394 02166 01317 00770 00437 00242

1\11 15 16 17 18 19 20 1\11 18 19 20

53 0.3599 02319 0.1421 0.0837 0.0478 0.0266 79 0.3994 0.2706 0.1744


54 03808 02477 01530 0.0907 00521 0.0291 80 04159 02839 01841
55 0.4020 02641 0.1645 0.0982 00567 0.0319 81 0.4325 02974 0.1942
56 04235 02809 0.1764 01061 0.0616 00348 82 04493 0.3113 02045
57 04452 02983 0.1889 01144 00668 00379 83 04661 03254 02152
58 0.4670 03161 02019 0.1231 00723 00413 84 04831 03397 02262
59 04890 0.3343 0.2153 0.1323 00782 0.0448 85 0.5000 03543 0.2375
60 05110 03529 02293 01419 00844 00487 86 03690 02490
61 0.3718 0.2437 0.1519 00909 0.0527 87 0.3840 02608
62 03910 02585 01624 00978 00570 88 03991 02729
63 0.4104 02738 01733 01051 00615 89 04144 02853
64 0.4301 0.2895 0.1846 0.1127 0.0664 90 04298 0.2979
65 04500 0.3056 01964 01206 00715 91 04453 03108
66 04699 0.3221 02086 01290 00768 92 0.4609 03238
67 04900 03389 02211 01377 0.0825 93 04765 03371
68 05100 03559 02341 01467 00884 94 04922 0.3506
69 03733 02475 01562 0.0947 95 05078 0.3643
70 03910 02613 01660 01012 96 03781
71 04088 0.2754 0.1762 0.1081 97 03921
72 0.4268 02899 01868 01153 98 0.4062
73 04450 03047 0.1977 0.1227 99 0.4204
74 0.4633 0.3198 0.2090 0.1305 100 04347
75 04816 03353 0.2207 0.1387 101 04492
76 05000 03509 0.2327 01471 102 04636
77 03669 0.2450 0.1559 103 04782
78 0.3830 0.2576 0.1650 104 04927
For larger II, Table A can be used with the approximate standard normal deViate z = (t - I/l/tJ where I' = 1/ (1/ + 1l/4
and tJ = [1/(11 + 1)(211 + Il/24]1/2. The table below gives /1 and tJ for 21 S; II S; 40

n 21 22 23 24 25 26 27 28 29 30
J1 115.5 1265 138.0 150.0 162.5 175.5 189.0 203.0 2175 2325
tJ 28.77 30.80 32.88 35.00 37.17 39.37 41.62 43.91 4625 47.83

1/ 31 32 33 34 35 36 37 38 39 40
/1 2480 264 0 280.5 2975 315.0 3330 351.5 370.5 390 0 410.0
tJ 5303 53.48 55.97 5849 61.05 6365 66.29 6895 7166 7440

Source Adapted from Table II ofH L Harter and D BOwen, Eds (1972). Selecled Tables //I MathematIcal StaIlSI/CS, Vol I.
Markham PubhshlRg Co • Chicago. With permission of The Inslltute of Mathematical StatIStiCS
Tables 435

Table E Cumulative Probabilities for Hypergeometric Distribution


Each table entry IS the cumulatIve probablhty of a or less (left-taIl) under the null dlstnbutlOn where t = N 12 If N IS
even and t = (N ± 1)/2 If N IS odd, where N = 111 + /I The entnes apply to a 2 x 2 table (hke that shown In Table 3 I of
Chap 5) arranged so that 2a < 111 :s; t :s; /I as a result of mterchangmg rows, columns, rows wIth columns, or lower and
upper taIls as necessary For 111 = I, use PIA = 0) = (/I - t + 1)/(/1 + I) for any /I Other entnes not lIlc1uded have P =
00000 or P > 05 Rlght-tatl cumulaltve probabtllltes are found from thIs table using PIA ;:: II) = I - PIA :s; " - I)

11-111 0 I 2 3 3 4 5 5 6 7
mil /- m 0 0 I 2 2 2 3 3 4

2 0 01667 03000 01000 02000 0.2857 01429 02143 0.2778 01667 02222 02727 01818
I 08333 09000 07000 08000 0.8571 07143 07857 08333 07222 0.7778 08182 07273
3 0 0.0500 01143 00286 00714 01190 00476 00833 01212 0.0606 00909 01224 00699
1 0.5000 0.6286 03714 05000 05952 0.4048 05000 05758 04242 05000 05629 04371
4 0 00143 00397 00079 00238 0.0455 00152 0.0303 0.0490 0.0210 00350 00513 00256
1 02429 03571 01667 02619 0.3485 0.1970 0.2727 03427 02168 02797 03385 02308
2 07571 08333 06429 07381 08030 06515 07273 07832 06573 07203 0.7692 06615
5 0 00040 0.0130 00022 00076 00163 00047 00105 00186 00070 00128 00204 00091
1 01032 0.1753 00671 01212 01795 00862 01329 01818 01002 01410 01833 01109
2 05000 06082 03918 05000 05874 04126 05000 05734 04266 05000 05633 04367
6 0 00011 00041 00006 00023 00056 0.0014 0.0035 00068 00023 00045 00077 00031
1 00400 0.0775 00251 00513 00839 00350 00594 00882 00430 00656 00913 00495
2 02836 0.3835 02086 02960 03776 02308 03042 03733 02466 03100 03700 02585
3 07165 0.7914 06166 07040 07692 06224 06958 07534 06267 06900 07415 06300
7 0 00003 00012 00002 00007 00019 00004 00011 0.0024 00007 00015 00028 00010
I 00146 00317 00089 00203 00364 00134 00249 00399 00174 00286 00426 00209
2 01431 02145 01002 01573 02178 o 1170 01674 02199 01299 01749 02214 01401
3 05000 05952 04048 05000 05806 04194 05000 05700 04300 05000 05619 04381
8 0 00001 00004 00000 00002 00006 00001 00004 00008 00002 0.0005 00010 00003
1 00051 00122 00030 00076 0.0149 00049 00099 00170 00067 00119 00188 00084
2 00660 01090 00445 0.0767 01149 00549 00849 01192 00635 00913 0.1224 0.0706
3 03096 0.3992 02380 03186 03950 02549 0.3250 03916 02678 03297 03889 02779
4 06904 07620 06008 06814 07451 06050 06750 07322 06084 06703 07221 06111
9 0 00000 00001 00000 00001 00002 00000 00001 00003 00001 00002 00004 00001
I 00017 00045 00010 00027 00058 00017 00038 00069 00025 00047 00079 00033
2 0.0283 00513 00185 00349 00563 00242 00402 00602 00291 00447 00633 00335
3 01735 02422 01276 01849 02449 01421 01935 02468 01535 02002 02481 01628
4 05000 05859 04141 0.5000 05750 04250 05000 05666 04334 05000 05600 04400
10 0 00000 0.0000 0.0000 0.0000 0.0001 00000 00000 00001 00000 00001 00001 00000
I 00005 0.0016 00003 00010 00022 00006 00014 0.0027 00009 0.0018 0.0032 0.0012
2 0.0115 0.0226 0.0073 00150 00260 00101 0.0180 0.0287 0.0127 0.0207 0.0310 00151
3 00894 01349 00635 0.0992 01402 00736 01069 0.1442 0.0820 0.1131 01473 00891
4 0.3281 04100 02599 0.3350 0.4067 02735 0.3401 04041 02841 03441 04018 02928
5 0.6719 07401 05900 0.6650 07265 0.5933 0.6599 07159 05959 06559 07072 05982
II I 0.0002 0.0005 00001 00003 00008 0.0002 00005 0.0010 0.0003 0.0007 00013 00004
2 0.0045 0.0095 0.0028 00061 00114 00040 0.0077 0.0130 0.0053 0.0092 00144 00065
3 00431 0.0699 00296 0.0498 0.0749 0.0358 0.0554 0.0789 00412 00601 00821 0.0460
4 0.1974 0.2632 01504 0.2068 0.2655 01628 02142 0.2671 01730 02200 02683 01814
5 05000 05789 04211 05000 0.5704 0.4296 05000 05635 04365 05000 05579 04421
12 1 00001 00002 00000 0.0001 0.0003 0.0001 00002 00004 00001 0.0002 00005 0.0002
2 00017 00038 0.0010 0.0024 00048 00016 00032 0.0056 00021 0.0039 00064 00027
3 00196 0.0341 00131 0.0236 0.0377 0.0165 0.0271 00407 00197 00302 00433 00226
4 01102 01566 00812 01189 0.1612 00906 01259 01649 0.0987 0.1318 01678 01056
5 0.3421 04179 0.2772 0.3475 04153 0.2883 0.3518 0.4131 0.2973 03552 04112 03047
6 06579 0.7228 05821 06525 07117 0.8388 0.6482 07027 05869 06448 0.6953 05888
13 1 00000 00001 00000 00000 00001 00000 00001 00001 0.0000 0.0001 0.0002 00001
2 00006 0.0015 0.0004 0.0009 0.0019 0.0006 0.0013 0.0024 00008 00016 00028 00011
3 0.0085 00157 00056 00107 00180 00073 00127 00200 00090 00145 00218 00106
4 00576 00871 00412 00642 00919 0.0476 00697 00957 00531 00744 00990 00581
436 Tables

Table E (continued)
11-/11 o I 2 3 3 4 5 5 6 7 7
/II a 1-/11 o o I I 2 2 2 3 3 3 4

5 02169 0.2798 0.1697 02247 0.2817 01804 0.2311 02831 01814 02363 02842 0.1970
6 05000 05734 04266 05000 0.5664 0.4336 0.5000 05607 0.4383 05000 05559 04441
14 I 00000 00000 0.0000 00000 0.0000 0.0000 0.0000 00000 00000 00000 0.0001 00000
2 00002 00006 00001 00003 0.0008 0.0002 0.0005 0.0010 0.0003 00006 00012 00004
3 0.0035 00070 0.0023 0.0046 00082 0.0031 0.0057 00094 00039 00067 00105 00048
4 00285 0.0457 00199 0.0328 0.0495 00237 0.0366 0.0526 00272 00399 00554 00304
5 01284 0.1749 00974 01362 0.1790 0.1061 0.1426 01823 0.1137 0.1480 0.1851 0.1202
6 03532 0.4241 02912 0.3576 04219 03005 0.3612 04201 03082 03641 04185 0.3147
7 06468 0.7088 0.5758 0.6424 06995 0.5781 0.6388 06918 0.5799 06359 0.6853 05815
15 2 00001 00002 00009 00001 00003 00001 0.0002 00004 0.0001 00002 0.0005 00002
3 00014 00030 0.0092 00019 00036 00013 00024 00043 0.0017 0.0030 0.0049 0.0021
4 00134 00228 0.0528 0.0160 00253 00113 0.0183 0.0276 0.0133 00205 0.0296 0.0153
5 00716 01028 01862 00778 0.1072 0.0591 0.0832 01109 0.0646 00878 0.1141 0.0696
6 0.2330 0.2933 0.4311 02397 0.2949 0.1956 0.2453 0.2962 0.2036 0.2499 0.2972 0.2105
7 0.5000 05689 07067 0.5000 0.5631 04369 0.5000 05582 0.4418 05000 0.5541 0.4459
For larger sample Sizes, use

peA = 0) = (;) /(:,)

and

peA < a)
-
= [I + _I_m_ + ... + 1(1 + 1)"'(1 + a - I)
II - I + I (11- I + 1)"'(11 - I + a) a
(n)]O/(N)
I m
Alternatively, approximate cumulative probabllllles can be found from Table A With the approximate standard normal
deViate

Z= la'd' _ b'c'l 2L ( I -
a"d"-b"C"[ _I)
6N1)/( I + 61111)( I + 6,;1)( I + 6i1)( I + 6(N 1)]"2
where a', b', c', d' are the cell entnes a, b, c, d 10 the 2 x 2 table corrected by 1/2 respecllvely, and

a" = a' + ~ + ~ + O.ot + O.ot


6 a'+05 m+1 1+1'
and Similarly for b", c", and d", With m and I replaced by the row and column total for the entry in question, and

L=a,lna'N +b'lnb'N +c'ln~+(/,ln~


1m III meN - I) II(N - t)

in terms of the naturalloganthms In. If common logarithms are used, replace 2L by 4 60517 L. ThiS normal approxlma-
lion has at least the accuracy specified below if mm(a, b - I, C - I, d) IS at least the value shown See Lmg and Pratt
(1980] for the mmimum guaranteed accuracy for other values ofmm(a, b - I, c - I,d) and other tall probabilities.
mm(a, b - I, c - I,d) 3 4 6 8 12 24 50
Any tall probability 0.0'50 0.0'33 0.0'17 0.0'10 004 50 0.04 14 0.0'35
Tall probability :s;0.05 00'25 0.0'16 0.04 94 0.04 53 0.04 28 0.0'87 00'27
Tail probability :s;0.01 0.0'13 0.0 4 84 004 45 00 4 29 0.04 15 00'51 0.0'16

Source. Adapted from G J. Lieberman and D. BOwen (1961), Tables of ,lie Hypergeomelrlc Probablluy DISlrlbllllOlI, Stanford
Umvers/ty Press, Stanford, Cahforma, WIth permission.
Tables 437

Table F Cumulative Probabilities for Wilcoxon Rank Sum Statistic


Each table entry labeled P IS the left-tail cumulative probability of Rx under the null
distribution for 2 ~ m ~ n ~ 10. For m = I, use P(R x = /) = I/(n + I) for i = I, ... ,
n + I. Right-tail cumulative probabilities are found from
P(R x ~ t) = P[R x ~ m(N + I) - t].

R, P R, P R, P R, P

m = 2, n=2 m = 2, n =7 8 0.182 12 0.274


9 0.242 13 0.357
3 0.028
3 0.167 10 0.303 14 0.452
4 0.056 0.548
4 0.333 11 0.379 15
5 0.111
5 0.667 12 0.455
6 0.167
13 0.0545
m = 3, n =7
m = 2, n=3 7 0.250
6 0.008
8 0.333 m = 3, n =3 7 0.017
3 0.100
9 0.444
4 0.200 6 0.050 8 0.033
10 0.556
5 0.400 7 0.100 9 0.058
6 0.600 m = 2, n =8 8 0.200 10 0.092
9 0.350 11 0.133
m = 2, n=4 3 0.022 10 0.500 12 0.192
4 0.044 13 0.258
3 0.067
0.133
5 0.089 m = 3, n=4 14 0.333
4
6 0.133 15 0.417
5 0.267 6 0.029
7 0.200 16 0.500
6 0.400 7 0.057
8 0.267
7 0.600 8 0.114
9 0.356
9 0.200
m= 3, n=8
m= 2, n=5 10 0.444
10 0.314 6 0.006
11 0.556
3 0.048 11 0.429 7 0.012
4 0.095 12 0.571 8 0.024
5 0.190
m = 2, n=9
9 0.042
6 0.286 3 0.018
m = 3, n =5 10 0.067
7 0.429 4 0.036 6 0.018 11 0.097
8 0.571 5 0.073 7 0.036 12 0.139
6 0.109 8 0.071 13 0.188
m= 2, n= 6 7 0.164 9 0.125 14 0.248
3 0.036 8 0.218 10 0.196 15 0.315
4 0.071 9 0.291 11 0.286 16 0.388
5 0.143 10 0.364 12 0.393 17 0.461
6 0.214 11 0.455 13 0.500 18 0.539
7 0.321 12 0.545
m= 3, n=6 m= 3, n=9
8 0.429
0.571
m = 2, n = 10 6 0.012 6 0.005
9
3 0.Ql5 7 0.024 7 0.009
4 0.030 8 0.048 8 0.018
5 0.061 9 0.083 9 0.032
6 0.091 10 0.131 10 0.050
7 0.136 11 0.190 11 0.073
438 Tables

Table F (continued)

R, p R, P R, P R, P

12 0.105 15 0.143 14 0.024 19 0.071


13 0.141 16 0.206 15 0.036 20 0.094
14 0.186 17 0.278 16 0.055 21 0.120
15 0.241 18 0.365 17 0.077 22 0.152
16 0.300 19 0.452 18 0.107 23 0.187
17 0.364 20 0.548 19 0.141 24 0.227
18 0.432 20 0.184 25 0.270
19 0.500 m= 4, n=6 26
21 0.230 0.318
10 0.005 22 0.285 27 0.367
m = 3, n = 10
0.341 28 0.420
11 0.010 23
6 0.003 12 0.019 24 0.404 29 0.473
7 0.007 13 0.033 25 0.467 30 0.527
8 0.014 14 0.057 26 0.533
9 0.024 15 0.086 m = 5, n =5
10 0.038 16 0.129 m = 4, n=9
0.004
15
11 0.056 17 0.176 10 0.001 16 0.008
12 0.080 18 0.238 11 0.003 17 0.016
13 0.108 19 0.305 12 0.006 18 0.028
14 0.143 20 0.381 13 0.010 19 0.048
15 0.185 21 0.457 14 0.017 20 0.075
16 0.234 22 0.543 15 0.025 21 0.111
17 0.287 16 0.038 22 0.155
18 0.346 m = 4, n=7 17 0.053 23 0.210
19 0.406 18 0.074 24 0.274
10 0.003
20 0.469 19 0.099 25 0.345
II 0.006
21 0.531 20 0.130 26 0.421
12 0.012
21 0.165 27 0.500
m = 4, n=4 13 0.021
14 0.036 22 0.207
23
m= 5, n=6
10 0.014 IS 0.055 0.252
II 0.029 16 0.082 24 0.302 15 0.002
12 0.057 17 0.115 25 0.355 16 0.004
13 0.100 18 0.158 26 0.413 17 0.009
14 0.171 19 0.206 27 0.470 18 0.015
15 0.243 20 0.264 28 0.530 19 0.026
16 0.343 21 0.324 20 0.041
17 0.443
m= 4, n = 10 0.063
22 0.394 21
18 0.557 23 0.464 10 0.001 22 0.089
24 0.536 II 0.002 23 0.123
m= 4, n= 5
12 0.004 24 0.165
10 0.008 m= 4, n =8 13 0.007 25 0.214
11 0.016 10 0.002 14 0.012 26 0.268
12 0.032 II 0.004 IS 0.018 27 0.331
13 0.056 12 0.008 16 0.027 28 0.396
14 0.095 13 0.014 17 0.038 29 0.465
18 0.053 30 0.535
Tables 439

Table F (continued)

R, p R, P R, P R, P

m = 5, n= 7 m = 5, n=9 33 0.220 35 0.183


34 0.257 36 0.223
15 0.001 15 0.000
35 0.297 37 0.267
16 0.003 16 0.001
36 0.339 38 0.314
17 0.005 17 0.002
37 0.384 39 0.365
18 0.009 18 0.003
38 0.430 40 0.418
19 O.oJ5 19 0.006
39 0.477 41 0.473
20 0.024 20 0.009
40 0.523 42 0.527
21 0.037 21 0.014
22 0.053 22 0.021 m = 6, n =6 m = 6, n = 8
23 0.074 23 0.030
24 0.041 21 0.001 21 0.000
24 0.101
25 0.056 22 0.002 22 0.001
25 0.134
26 0.073 23 0.004 23 0.001
26 0.172
27 0.095 24 0.008 24 0.002
27 0.216
28 0.120 25 0.013 25 0.004
28 0.265
29 0.149 26 0.021 26 0.006
29 0.319
30 0.182 27 0.032 27 0.010
30 0.378
31 0.219 28 0.047 28 O.oJ5
31 0.438
32 0.259 29 0.066 29 0.021
32 0.500
33 0.303 30 0.090 30 0.030
m = 5, n=8 34 0.350 31 0.120 31 0.041
35 0.399 32 0.155 32 0.054
15 0.001 33 0.197
36 0.449 33 0.071
16 0.002 34 0.242 0.091
37 0.500 34
17 0.003 35 0.294 35 0.114
18 0.005 m = 5, n = 10 36 0.350 0.141
36
19 0.009 37 0.409 0.172
15 0.000 37
20 0.015 38 0.469 38 0.207
21 0.023 16 0.001
39 0.531 39 0.245
22 0.033 17 0.001
0.002 40 0.286
23 0.047 18 m = 6, n =7 0.331
19 0.004 41
24 0.064
0.006 21 0.001 42 0.377
25 0.085 20
22 0.001 43 0.426
26 0.111 21 0.010
23 0.002 44 0.475
27 0.142 22 0.014
23 0.020 24 0.004 45 0.525
28 0.177
24 0.028 25 0.007
29 0.218 m = 6, n=9
25 0.038 26 0.01l
30 0.262
26 0.050 27 0.017 21 0.000
31 0.311
27 0.065 28 0.026 22 0.000
32 0.362
28 0.082 29 0.037 23 0.001
33 0.416
29 0.103 30 0.051 24 0.001
34 0.472
30 0.127 31 0.069 25 0.002
35 0.528
31 0.155 32 0.090 26 0.004
32 0.185 33 0.117 27 0.006
34 0.147 28 0.009
440 Tables

Table F (continued)

R, p R, P R, P R, P

29 0.013 44 0.246 36 0.010 50 0.176


30 0.018 45 0.281 37 0.014 51 0.204
31 0.025 46 0.318 38 0.020 52 0.235
32 0.033 47 0.356 39 0.027 53 0.268
33 0.044 48 0.396 40 0.036 54 0.303
34 0.057 49 0.437 41 0.047 55 0.340
35 0.072 50 0.479 42 0.060 56 0.379
36 0.091 51 0.521 43 0.076 57 0.419
37 0.112 44 0.095 58 0.459
m = 7, 11 =7
38 0.136 45 0.116 59 0.500
39 0.164 28 0.000 46 0.140
m = 7, 11 = 10
40 0.194 29 0.001 47 0.168
41 0.228 30 0.001 48 0.198 28 0.000
42 0.264 31 0.002 49 0.232 29 0.000
43 0.303 32 0.003 50 0.268 30 0.000
44 0.344 33 0.006 51 0.306 31 0.000
45 0.388 34 0.009 52 0.347 32 0.001
46 0.432 35 0.013 53 0.389 33 0.001
47 0.477 36 0.019 54 0.433 34 0.002
48 0.523 37 0.027 55 0.478 35 0.002
38 0.036 56 0.522 36 0.003
m = 6, 11 = 10 39 0.049 37 0.005
0.064 m = 7, 11=9
38 0.007
21 0.000 40
22 0.000 41 0.082 28 0.000 39 0.009
23 0.000 42 0.104 29 0.000 40 0.012
24 0.001 43 0.130 30 0.000 41 0.017
25 0.001 44 0.159 31 0.001 42 0.022
26 0.002 45 0.191 32 0.001 43 0.028
27 0.004 46 0.228 33 0.002 44 0.035
28 0.005 47 0.267 34 0.003 45 0.044
29 0.008 48 0.310 35 0.004 46 0.054
30 0.011 49 0.355 36 0.006 47 0.067
31 0.016 50 0.402 37 0.008 48 0.081
32 0.021 51 0.451 38 0.011 49 0.097
33 0.028 52 0.500 39 0.016 50 0.115
34 0.036 40 0.021 51 0.135
m = 7, 11 =8 52 0.157
35 0.047 41 0.027
36 0.059 28 0.000 42 0.036 53 0.182
37 0.074 29 0.000 43 0.045 54 0.209
38 0.090 30 0.001 44 0.057 55 0.237
39 0.110 31 0.001 45 0.071 56 0.268
40 0.132 32 0.002 46 0.087 57 0.300
41 0.157 33 0.003 47 0.105 58 0.335
42 0.184 34 0.005 48 0.17 59 0.370
43 0.214 35 0.007 49 0.' J 60 0.406
61 0.443
62 0.481
63 0.519
Tables 441

Table F (continued)

R, p R, P R, P R, P

m = 8, n=8 44 0.003 50 0.010 61 0.016


45 0.004 51 0.013 62 0.020
36 0.000
46 0.006 52 0.017 63 0.025
37 0.000
47 0.008 53 0.022 64 0.031
38 0.000
48 0.010 54 0.027 65 0.039
39 0.001
49 0.014 55 0.034 66 0.047
40 0.001
50 0.018 56 0.042 67 0.057
41 0.001
51 0.023 57 0.051 68 0.068
42 0.002 0.061 0.081
52 0.030 58 69
43 0.003 0.095
53 0.037 59 0.073 70
44 0.005
54 0.046 60 0.086 71 0.111
45 0.007
55 0.057 61 0.102 72 0.129
46 0.010 0.149
56 0.069 62 0.118 73
47 0.014
57 0.084 63 0.137 74 0.170
48 0.019
58 0.100 64 0.158 75 0.193
49 0.025
59 0.118 65 0.180 76 0.218
50 0.032
60 0.138 66 0.204 77 0.245
51 0.041
61 0.161 67 0.230 78 0.273
52 0.052
62 0.185 68 0.257 79 0.302
53 0.065
63 0.212 69 0.286 80 0.333
54 0.080
64 0.240 70 0.317 81 0.365
55 0.097
65 0.271 71 0.348 82 0.398
56 0.117
66 0.303 72 0.381 83 0.432
57 0.139
67 0.336 73 0.414 84 0.466
58 0.164
68 0.371 74 0.448 85 0.500
59 0.191
69 0.407 75 0.483
60 0.221 m = 9, n = 10
70 0.444 76 0.517
61 0.253
71 0.481 45 0.000
62 0.287 m = 9, n=9
72 0.519 46 0.000
63 0.323
45 0.000 47 0.000
64 0.360 m = 8, n=1O
46 0.000 48 0.000
65 0.399
36 0.000 47 0.000 49 0.000
66 0.439
37 0.000 48 0.000 50 0.000
67 0.480
38 0.000 49 0.000 51 0.000
68 0.520
39 0.000 50 0.000 52 0.000
40 0.000 51 0.001 53 0.001
m = 8, n=9
41 0.000 52 0.001 54 0.001
36 0.000
42 0.001 53 0.001 55 0.001
37 0.000
43 0.001 54 0.002 56 0.002
38 0.000
44 0.002 55 0.003 57 0.003
39 0.000
45 0.002 56 0.004 58 0.004
40 0.000
46 0.003 57 0.005 59 0.005
41 0.001
47 0.004 58 0.007 60 0.007
42 0.001
48 0.006 59 0.009 61 0.009
43 0.002
49 0.008 60 0.012 62 0.011
442 Tables

Table F (continued)

R, p R, P R, P

63 0.014 m = 10, n = 10 81 0.038


64 0.017 82 0.045
55 0.000
65 0.022 83 0.053
56 0.000
66 0.027 84 0.062
57 0.000
67 0.033
58 0.000
85 o.on
68 0.039 86 0.083
59 0.000
69 0.047 87 0.095
60 0.000
70 0.056 88 0.109
61 0.000
71 0.067 89 0.124
62 0.000
n 0.078
63 0.000
90 0.140
73 0.091 91 0.157
64 0.001
74 0.106 92 0.176
65 0.001
75 0.121 93 0.197
66 0.001 0.218
76 0.139 94
67 0.001
77 0.158 95 0.241
68 0.002
78 0.178 96 0.264
69 0.003
79 0.200 97 0.289
0.223 70 0.003 98 0.315
80
0.248 71 0.004 99 0.342
81
82 0.274 n 0.006 100 0.370
83 0.302 73 0.007 101 0.398
0.330 74 0.009 102 0.427
84
0.360 75 0.012 103 0.456
85
0.390 76 0.014 104 0.485
86
0.421 77 0.018 105 0.515
87
0.452 78 0.022
88
0.484 79 0.026
89
0.516 80 0.032
90

For m or n larger than 10, the probabilities are found from Table A as follows:
Rx + 0.5 - m(N + 1)/2 Rx - 0.5 - m(N + 1)/2
ZL = ZR =
Jmn(N + 1)/12 Jmn(N + 1)/12
Desired Approximated by
Left tail probability for Rx Right tail probability for - Z f.
RIght tail probability for R, Right tail probability for ZR

Source: Adapted from Table B of C H. Kraft and C. Van Eeden (1969), A Non-parametric
IntroductIOn to Statistics, Macmillan PublIshing Co, New York, with permIssion.
Tables 443

Table G Kolmogorov-Smirnov Two-Sample Statistic


Each table entry labeled P IS the cumulative right-tail probability of milD... under the null distributIOn
for small P, all 2 ~ m ~ n ~ 12 or m + n ~ 16, whichever comes first. For the one-Sided statistics
mnD;:;" and milD;;;., the correspondmg probability IS P/2 If P IS very small. Each table entry m the second
portIOn of the table, where 9 ~ m = n ~ 20, IS the smallest value of mnD... for which the cumulallve
nghHal1 probability does not exceed the selected values 0010,0020,0050,0 100, and 0 200 at the top
For mnD;. and milD;;;', each entry IS approximately correct for the cumulallve right-tall probability at
the bottom

m n mnDm" P m n milD"," P m n mnDmn P

2 2 4 0.333 3 10 30 0007 4 II 44 0.001


2 3 6 0.200 27 0028 40 0007
2 4 8 0.133 24 0070 36 0022
2 5 10 0095 21 0.140 33 0.035
8 0286 3 II 33 0005 32 0063
2 6 12 0.071 30 0.022 29 0098
10 0214 27 0055 28 0144
2 7 14 0.056 24 0.110 4 12 48 0001
12 0167 3 12 36 0.004 44 0.005
2 8 16 0044 33 O.ot8 40 0.016
14 0.133 30 0.044 36 0.048
2 9 18 0.036 27 0088 32 0112
16 0.109 24 0.189 5 5 25 0008
2 10 20 0030 4 4 16 0.029 5 6 30 0.004
18 0091 12 0.229 25 0026
16 0.182 4 5 20 0016 24 0048
2 II 22 0026 16 0.079 20 0.108
20 0077 15 0143 5 7 35 0.003
18 0154 4 6 24 0010 30 0015
2 12 24 0022 20 0048 28 0030
22 0.066 18 0.095 25 0066
20 0132 16 0181 23 0116
3 9 0.100 4 7 28 0.006 5 8 40 0002
3 4 12 0057 24 0030 35 0009
9 0229 21 0067 32 0020
3 5 15 0.036 20 0.121 30 0.042
12 0143 4 8 32 0004 27 0.079
3 6 18 0024 28 0.020 25 0126
15 0.095 24 0.085 5 9 45 0001
12 0333 20 0.222 40 0006
3 7 21 0017 4 9 36 0.003 36 0014
18 0067 32 0014 35 0028
15 0167 28 0042 31 0.056
3 8 24 0.012 27 0.062 30 0086
21 0.048 24 0.115 27 0119
18 0.121 4 10 40 0.002 5 10 50 0.001
3 9 27 0009 36 0.010 45 0004
24 0.036 32 0.030 40 0.019
21 0.091 30 0.046 35 0061
18 0.236 28 0.084 30 0166
26 0.126
444 Tables

Table G (continued)
III n mnDmn P III n mnDmn P m n mnDml1 P

5 II 55 0.000 6 9 54 0.000 7 8 56 0000


50 0.003 48 0.003 49 0.002
45 0.010 45 0.006 48 0.005
44 0014 42 0014 42 0013
40 0.029 39 0.028 41 0.024
39 0044 36 0061 40 0.033
35 0074 33 0095 35 0.056
34 0.106 30 0176 34 0.087
6 U 36 0.002 6 10 60 0.000 33 0.118
30 0.026 54 0.002 7 9 63 0000
6 7 42 0001 50 0.004 56 0.001
36 0.008 48 0.009 54 0.003
35 0.015 44 0.019 49 0.008
30 0.038 42 0.031 47 O.oJ5
29 0.068 40 0.042 45 0021
28 0091 38 0066 42 0034
24 0.147 36 0.092 40 0.055
6 8 48 0001 34 0125 38 0.079
42 0.005 7 7 49 0001 36 0.098
40 0.009 42 0.008 35 0.127
36 0.023 35 0053 8 8 64 0.000
34 0.043 28 0212 56 0002
32 0.061 48 0.019
30 0093 40 0087
28 0.139 32 0283

RIght-TaIl ProbabIlity for nmD... (Two-SIded Statistic)


m = Ii 0.200 0.100 0050 0020 0010

9 45 54 54 63 63
10 50 60 70 70 80
II 66 66 77 88 88
12 72 72 84 96 96
13 78 91 91 104 117
14 84 98 112 112 126
15 90 105 120 135 135
16 112 112 128 144 160
17 119 136 136 153 170
18 126 144 162 180 180
19 133 152 171 190 190
20 140 160 180 200 220

0100 0050 0025 0.010 0005

Approximate RIght-Tail ProbabIlity for IIII1D~. and IIII1D;'. (One-SIded Statistic)


For sample sIzes outsIde the range of thIs table, the quantile pOints based on the asymptotic distribution
are approxImated by calculating the follOWing for the appropnate values of III and II
Right Tall ProbabIlity for D...
0.200 0100 0050 0020 0010
1.07JN/mn 1.22JN/mn 1.36JN/llln I 52.jNin"; I 63JN/1II11
0.100 0.050 0025 0010 0.005
RIght Tall ProbabIlity for D..~ or D,::.

Source Adapted from Table I of H L. Harler and D B. Owen, Eds (t970), Selected Tables //I Mathemallcal
StallStlCS, Vol I, Markham Publlshmg Company, ChIcago, WIth permIssIon of the InstItute of Mathemattcal
Stattsttcs.
Bibliography

Alling, David W. : Early decision in the Wilcoxon two-sample test. J. Am. Stat. Assoc. 58,
713-720 (l963).
Anderson, T. W., Darling, D. A.: Asymptotic theory of certain" goodness of fit" test
criteria based on stochastic processes. Annals of Math. Stat. 23, 193-212 (1952).
Bahadur, R. R.: Stochastic comparison of tests. Annals ofMath. Stat. 31, 276-295 (1960).
Bahadur, R. R.: Some Limit Theorems in Statistics. (CBMS Monograph No.4),
Philadelphia: SIAM, 1971.
Barton, D. E., Mallows, C. L.: Some aspects of the random sequence. Annals of Math.
Stat. 36, 236-260 (1965).
Bauer, D F.: Constructing confidence sets using rank statistics. J Am. Stat. Assoc. 67,
687-690 (1972).
Birnbaum, A.: On the foundations of statistical inference. J. Am. Stat. Assoc. 57,
269-306 (1962).
Birnbaum, Z. W.: Numerical tabulation of the distribution of Kolmogorov's statistic
for finite sample size. J. Am. Stat. Assoc. 47, 425-441 (1952).
Birnbaum. Z. W.: On the use of the Mann-Whitney statistic. Proc. of the Third Berkeley
Symp. Math. Stat. and Probability, Vol. I, Berkeley: Untv. Calif. 1956, pp. 13-17.
Birnbaum, Z. W., Hall, R. A.: Small sample dIstributions for multI-sample statistics of
the Smirnov type. Annals of Math. Stat. 31, 710-720 (I 960}.
BIrnbaum, Z. W., Klose, O. M.: Bounds for the variance of the Mann-Whitney statIstIc.
Annals of Math. Stat. 28, 933-945 (1957).
Birnbaum, Z. W., Tingey, F. H.: One-sided confidence contours for probability distri-
butions. Annals of Math. Stat. 22, 592-596 (1951).
Blackman, J.: An extension of the Kolmogorov distnbution. Annals of Math. Stat. 27,
513-520 (1956). Correction, ibid. 29, 318-324 (1958).
Blyth. C. R. : Note on relative efficiency of tests. Annals ofMath. Stat. 29, 898-903 (l958).
Box, G. E. P., Andersen, S. L.: Permutation theory III the denvatlon of robust cntena
and the study of departures from assumption. J. Royal Stat. Soc. E, 17, 1-34 (1955).
Bradley, James V.: Distribution-Free Statistical Tests. Englewood Cliffs, New Jersey:
Prentice-Hall, 1968.
Brascamp, H. J., Lieb, E. H.: On extensions of the Brunn-Minkowski and Prekopa-
Leindler Theorems, including inequalities for log concave functions, and with an
application to the diffusion equation. J. Func. Anal. 22, 366-389 (1976).

445
446 Blbhography

Bross, I. D. J.: Comment on "Does an observed sequence of numbers follow a simple


rule? (Another look at Bode's law)." J. Am. Stat. Assoc. 66, 562-564 (1971).
Camp, B. H.: ApproximatIOn to the point binomial. Annals of Math. Stat. 22, 130-131
(1951).
Capon,1.: A note on the asymptotic normality of the Mann-Whitney-Wilcoxon statistic.
J. Am. Stat. Assoc. 56, 687-691 (l96Ia).
Capon, J. : Asymptotic efficiency of certain locally most powerful rank tests. Annals of
Math. Stat. 32, 88-100 (196Ib).
Capon, J.: On the asymptotic efficiency of the Kolmogorov-Smirnov test. J. Am. Stat.
Assoc. 60, 843-853 (1965).
Carvalho, P. E. de 0.: On the distribution of the Kolmogorov-Smirnov D-Statistic.
Annals of Math. Stat. 30, 173-176 (1959).
Chapman, Douglas G.: A comparative study of several one-sided goodness-of-fit tests.
Annals of Math. Stat. 29, 655-674 (1958).
Chernoff, H.: A property of some Type A regions. Annals of Math. Stat. 22, 472-474
(1951 ).
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the
sum of observations. Annals of Math. Stat. 23, 493-507 (1952).
Chernoff, H.: Large sample theory: Parametric case. Annals of Math. Stat. 27, 1-22
(1956).
Chernoff, H., Savage, I. R.: Asymptotic normality and efficiency of certain nonpara-
metric test statistics. Annals of Math. Stat. 29,972-994 (1958).
Chung, 1. H., Fraser, D. A. S.: Randomization tests for a multivariate two-sample
problem. J. Am. Stat. Assoc. 53, 729-735 (1958).
Claypool, P. L., Holbert, D.: Accuracy of normal and Edgeworth approximations to
the distribution of the Wilcoxon signed rank statistic. J. Am. Stat. Assoc. 69, 255-258
(1974).
Clopper, C. J., Pearson, E. S.: The use of confidence or fiducial limits illustrated in the
case of the binomial. Biometrika, 26, 404-413 (1934).
Cochran, W. G.: The comparison of percentages in matched samples. Biometrika, 37,
256-266 (1950).
Conover, W. J.: Practical Nonparametric Statistics. New York: John Wiley, 1972.
Conover, W. J.: On some methods of HandlIng ties in the Wilcoxon signed-rank test.
J. Am. Stat. Assoc. 68, 985-988 (1973).
Cox, D. R.: Some problems connected with statistical inference. Annals of Math. Stat.
29,357-372 (1958a).
Cox: D. R.: The regression analysis of binary sequences. J. Royal Stat. Soc. B, 20, 215-
242 (1958b).
Cox, D. R.: Two further applications of a model for binary regression. Biometrika,
45,562-565 (1958c).
Cox, D. R.: Analysis of Binary Data. N.Y.: Halsted, 1970.
Cramer, H.: On the composition of elementary errors. SkandinaVlsk Aktuarietidskrift,
11,13-74,141-180 (1928).
Cramer, H.: Mathematical Methods of Statistics. Princeton, New Jersey: Princeton
Umv., 1946.
Crow, E. L.: Confidence intervals for a proportion. Biometrika, 43, 423-435 (1956).
Daniels, H. E.: The statistical theory of the strengths of bundles of threads. Proc. Royal
Stat. Soc. A, 183,405-435 (1945).
Darling, D. A.: The Kolmogorov-Smirnov, Cramer-von Mises tests. Annals of Math.
Stat. 28, 823-838 (1957).
Dempster, A. P.: Personal communication reported by Wilks (1962, p. 339), 1955.
Dempster, A. P.: Generahzed D,; statistics. Annals of Math. Stat. 30, 593-597
(1959).
Blbhography 447

Depaix, M.: Distributions de deviations maximales bilaterales entre deux echantIllons


independants de me me loi continue. Comptes Rendues Acad. Sci. Paris, 255, 2900-
2902 (1962).
Deuchler, G.: Uber die methoden der korrelationsrechung In del' piidagogik und
psychologie. Zeitschrift for Piidagogische Psychologie und Experimentelle Piidagogik,
15,114-\31,145-159,229-242 (1914).
Dixon, W. J.: Power under normality of several nonparametric tests. Annals of Math.
Stat. 25, 610-614 (1954).
Donsker, M. D.: Justification and extension of Doob's heuristic approach to the
Kolmogorov-Smirnov limit theorems. Annals of Math. Stat. 23, 277-281 (1952).
Doob, J. L.: Heuristic approach to the Kolmogorov-Smirnov limit theorems. Annals
of Math. Stat. 20, 393-403 (1949).
Doob,1. L.: Stochastic Processes. New York: John Wiley, 1953.
Drion, E. F. : Some dlstnbutlOn free tests for the difference between empirical cumulative
distribution functions. Annals of Math. Stat. 23, 563-574 (1952).
Dwass, M.: Modified randomization tests for nonparametric hypotheses. Annals of
Math. Stat. 28, 181-187 (1957).
Dwass, M.: The distribution of a generalized D: statistic. Annals of Math. Stat. 30,
1024-1028 (1959).
Edgeworth, F. Y.: On the probable errors of frequency-constants. J. Royal Stat. Soc.
71,381-397 (1908). Addendum, ibid., 72, 81-90 (1909).
Edgington, E. S.: Statistical Inference. The Distribution-Free Approach. New York:
McGraw-Hili, 1969.
Efron, B.: Does an observed sequence of numbers follow a simple rule? (Another look at
Bode's Law.) J. Am. Stat. Assoc. 66, 552-559 (1971); Comments and rejoinder, ibid.
559-568.
Ellison, B. E.: Two theorems for inferences about the normal distribution with appli-
cations in acceptance sampling. J. Am. Stat. Assoc. 59, 89-95 (1964).
Epstein, B. : Comparison of some nonparametric tests against normal alternatives with
an application to life testing. J. Am. Stat. Assoc. 50, 894-900 (1955).
Feller, W. E.: On the Kolomogorov-Smirnov limit theorems for empirical distributions.
Annals of Math. Stat. 19, 177-189 (1948).
Feller, W.: An IntroductIOn to Probability Theory and its Applications. New York: John
Wiley, 1969.
FeIlingham, S. A., Stoker, D. J.: An approximation for the exact distribution of the
Wilcoxon test for symmetry. J. Am. Stat. Assoc. 59, 899-905 (1964).
Festinger, L.: The significance of difference between means without reference to the
frequency distribution. Psychometrika, 11, 97-105 (1946).
Fisher, R A : The logic of inductive inference (with discussion). J. Royal Stat. Soc. 98,
39-82 (1935).
Fisher, R. A.: Design of Experiments. Edinburgh: Oliver & Boyd, 1966 (first edition
1935).
Fisher, R. A.: Statistical Methodsfor Research Workers. Edinburgh: Oliver and Boyd,
1970.
Fisher, R. A., Yates, F.: Statistical Tables. New York: Hafner, 1963.
Fisz, M.: Probability Theory and Mathematical Statistics. New York: John Wiley,
1963.
Fix, E., Hodges, J. L., Jr.: Significance probabilities of the Wilcoxon test. Annals of
Math. Stat. 26, 301-312 (1955).
Fraser, D. A. S.: Most powerful rank-type tests. Annals of Math. Stat. 28, 1040-1043
(1957a).
Fraser, D. A. S.: Nonparametric Methods in Statistics. New York: John Wiley, 1957b.
448 Blbhography

Gastwirth, J. L.: The first-median test: A two-sIded version of the control median test.
J. Am. Stat. Assoc. 63, 692-706 (1968).
Gastwirth, J. L., Wolff, S.: An elementary method of obtallling lower bounds on the
asymptotic power of rank tests. Annals of Math. Stat. 39, 2128-2130 (1968).
Gibbons, J. D.: A proposed two-sample rank test: The Psi test and its properties. J.
Royal Stat. Soc. B, 26, 305-312 (1964a).
GIbbons, J. D.: Effect of nonnormality on the power function of the sIgn test. J. Am.
Stat. Assoc. 59, 142-148 (1964b).
Gibbons, J. D.: On the power of two-sample rank tests on the equality of two distribu-
tion functions. J. Royal Stat. Soc. B, 26, 293-304 (I 964c).
Gibbons, J. D.: Nonparametric Statistical Inference. New York: McGraw-Hill, 1971.
Gibbons, J. D., Pratt, J. W.: P-values: Interpretation and methodology. The Am.
Statistician, 29, 20-25 (1975).
Gnedenko, B. V.: Tests of homogeneity of probability distributions in two independent
samples. Math. Nachrichten, 12,26-66 (1954).
Gnedenko, B. V., Korolyuk, V. S.: On the maximum discrepancy between two empirical
distributions (in Russian). Doklady Akad. Nauk SSSR, 80,525-528 (1951).
Good, 1. J.: Siglllficance tests in parallel and in series. J. Am. Stat. Assoc. 53, 799-813
(1958).
Good, 1. J. : A subjecttve evaluation of Bode's Law and an "objective" test for approxi-
mate numerical rationality. J. Am. Stat. Assoc. 64, 23-49 (1969).
Goodman, L. A.: Kolmogorov-Smirnov tests for psychologIcal research. Psych. Bull.
51, 160-168 (1954).
Gflzzle, J. E., Starmer, C. F., Koch, G. C.: AnalysIs of categorical data by linear models.
Biometrics, 25, 489-504 (1969).
Groeneboom, P., Oosterhoff, J.: Bahadur efficIency and probabilitIes of large deviations.
Statistica Neerlandlca, 31, 1-24 (1977).
Gurland, J: An inequality satisfied by the expectation of the reciprocal of a random
variable. The Am. Statistician, 21, (2), 24-25 (1967).
Guttman, 1. : Statistical Tolerance Regions: Classical and Bayesian. New York: Hafner
Press, 1970.
Halperin, M., Ware, J.: Early decision in a censored Wilcoxon two-sample test for
accumulating survival data. J. Am. Stat. Assoc. 69, 414-422 (1974).
Hajek, J.: Nonparametric Statistics. San Francisco: Holden-Day, 1969.
Hajek, J., Sidak, Z.: Theory of Rank Tests. New York: AcademIC, 1967.
Harter, H. L. : Expected values of normal order statistics. Biometrika, 48, 151-165 (1961).
Harter, H. L., Owen, D. B. (eds.): Selected Tables in Mathematical StatistICS, Vol. I,
Chicago: Markham Publ., 1970.
Hartigan, J. A.: USlllg subsample values as typical values. J. Am. Stat. Assoc. 64, 1303-
1317 (1969).
Harvard University Computation Laboratory: Tables of the Cumulative Binomial
Probability DistributIOn. Cambridge: Harvard Univ., 1955.
Hodges, J. L., Jr.: The siglllficance probability of the Smirnov two-sample test. Arkiv.
Mat., 3, 469-486 (1957).
Hodges, J. L., J r., Lehmann, E. L.: The efficiency of some nonparametflc competitors
of the t-test. Annals of Math. Stat. 27, 324--335 (1956).
Hoeffding, W.: A class of statistics with asymptotically normal distributIOn. Annals
of Math. Stat. 19,293-325 (1948).
Hoeffdlllg, W.: On the distribution of the number of successes in independent trials.
Annals of Math. Stat. 27, 713-721 (1956).
Hoeffding, W.: Review of S. S. Wilks, Mathemallcal StatIstIcs. Annals of Math. Stat.
33, 1467-1473 (1962).
Hogg, R. V.: Adaptive robust procedures: A partial review and some suggestIOns for
future applications and theory. J. Am. Stat. Assoc. 69, 909-927 (1974).
Bibliography 449

Hollander, M.: Rank tests for randomized blocks. Annals oj Math. Stat. 38, 867-877
(1967).
Huber, P. J.: Robust estimation in location. Annals oj Math. Stat. 35, 73-101 (1964).
Iman, R. E.: Use of a t-statistic as an approximation to the exact distribution of the
Wilcoxon signed ranks test statistic. Comm. in Stat. 3, 795-806 (1974).
Jacobson, J. E.: The Wilcoxon two-sample statistic: Tables and bibliography. J. Am.
Stat. Assoc. 58, 1086-1103 (1963).
Jeffreys, H.: Theory of Probability, 3rd ed. Oxford: Oxford Univ., 1961.
Johnson, N. L., Kotz, S.: Distributions in Statistics: Discrete Distributions. New York:
John Wiley, 1969.
Kac, M.: On deviations between theoretical and empirical distributions. Proc. Nat.
Academy of Sci. 35, 252-257 (1949).
Kadane, J. B.: For what use are tests of hypotheses and tests of significance. Introduc-
tion. Comm. in Stat. A,S, 735-736 (1976).
Karlin, S.: Decision theory for Polya type distributions. Case of two actions, I. Proc.
Third Berkeley Symp. on Math. Stat. and Probability, Vol. 1, Berkeley: Univ. Calif.,
1956, pp. 115-129.
Karlin, S.: P6lya-type distributions, II. Annals of Math. Stat. 28, 281-308 (l957a).
Karlin, S.: P6lya-type distributions, III: AdmIssibility for multi-action problems.
Annals oj Math. Stat. 28, 839-860 (1957b).
Karlin, S., Rubin, H.: Distributions possessing a monotone likelihood ratio. J. Am.
Stat. Assoc. 51, 637-643 (1956).
Kempthorne, 0.: Of what use are tests of significance and tests of hypotheses. Comm. in
Stat. A,S, 763-777 (1976).
Kimball, A. W.: Burnett, W. T., Jr., Doherty, D. G.: Chemical protection against
ionizing radiation. I. Sampling methods for screening compounds in radiation pro-
tectIon studies with mice. Radiation Research, 7, 1-12 (1957).
Klotz, J.: Small sample power and efficiency for the one sample Wilcoxon and normal
scores tests. Annals of Math. Stat. 34, 624-632 (1963).
Klotz, J.: Asymptotic efficiency of the two sample Kolmogorov-Smirnov test. J. Am.
Stat. Assoc. 62, 932-938 (1967).
Kolmogorov, A. N.: Sulla determinazione empirica di una legge di distribuzione.
Giorn. Inst. Ita!' Attuari, 4, 83-91 (1933).
Korolyuk, V. S.: Asymptotic expansions for the criterion of fit of A. N. Kolmogorov
and N. V. Smlrnov. Doklady Akad. Nauk SSSR, 93, 443-446 (1954). (Izvestiya Akad.
Nauk SSSR Ser. Mat. 19, 103-124 (1955).)
Korolyuk, V. S.: On the deviation of empirical distributions for the case of two inde-
pendent samples. Izvestiya Akad. Nauk SSSR Ser. Mat., 19, 81-96 (1955).
Kraft, C. H., Van Eeden, c.: A Nonparametric Introduction to StatIstics. New York:
Macmillan, 1968.
Kruskal, W. H.: Histoncal notes on the Wilcoxon unpaired two-sample test. J. Am.
Stat. Assoc. 52,356-360 (1957).
Kruskal, W. H.: "Tests of Significance" in International Encyclopedia oj the Social
Sciences, 14,238-249 (1968). New York: The Free Press.
Lancaster, H. 0.: Statistical control of counting experiments. Biometrika, 39, 419-422
(1952).
Lancaster, H. 0.: Significance tests in discrete distributions. J. Am. Stat. Assoc. 56,
223-234 (1961).
Lehman, S. Y.: Exact and approximate distribution for the Wilcoxon statistic with ties.
J. Am. Stat. Assoc. 56, 293-298 (1961).
Lehmann, E. L.: The power of rank tests. Annals of Math. Stat. 24, 23-43 (1953).
Lehmann, E. L.: Testm.q Statistical Hypotheses. New York: John Wiley, 1959.
Lehmann, E. L., Stein, C. : On the theory of some non-parametric hypotheses. Annals oj
Math. Stat. 20, 28-45 (1949).
450 BIbliography

LIeberman, G. J., Owen, D. B.: Tables of the Hypergeometric ProbabIlity Distribution.


Stanford: Stanford Univ., 1961.
Lindley, D. V.: A statistical paradox. Biometrika, 44, 187-192 (1957).
Lindley, D. V.: Bayesian Statistics, A Review. Philadelphia: SIAM, 1971.
Ling, R. F.: A survey of the accuracy of some approximations for t, X2 , and F tatl
probabilities. J. Am. Stat. Assoc. 73, 274-283 (1978).
Ling, R. F., Pratt, J. W.: The accuracyofa modified Peizer approximation to the hyper-
geometnc distnbution with comparison to some other approxImations. Tech. Rep.
no. 348, Clemson Univ., Dept. Math. Sci., July 1980.
Loeve, M.: Probability Theory. New York: Van Nostrand, 1955.
Madansky, A.: More on length of confidence intervals. J. Am. Stat. Assoc. 57, 586-589
(1962).
Mann, H. B., Whitney, D. R.: On a test whether one of two random variables is sto-
chastically larger than the other. Annals of Math. Stat. 18,50-60 (1947).
Mantel, N., Rahe, A. J.: Differentiated sign tests. Internat. Stat. Rev., 48, 19-28 (1980).
Massey, F. J. : A note on the estimation of a distribution function by confidence limits.
Annals of Math. Stat. 21, 116-119 (1950a).
Massey, F. J.: A note on the power of a nonparametric test. Annals of Math. Stat. 21,
440-443 (1950b); Correction, ibid. 23, 637-638 (1952).
Massey, F. J.: The distribution of the maximum deviation between two sample cumu-
lative step functions. Annals of Math. Stat. 22,125-128 (195Ia).
Massey, F. J.: The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc.
46, 68-78 (l95Ib).
Massey, F. J.: Distribution table for the deviation between two sample cumulatives.
Annals of Math. Stat. 23,435-441 (1952).
McCornack, R. L.: Extended tables of the Wilcoxon matched pair sIgned rank statistic.
J. Am. Stat. Assoc. 60, 864-871 (1965).
McNemar, Q.: Note on the sampling error of the difference between correlated pro-
portIOns or percentages. Psychometrika, 12,153-157 (1947).
Miller, L. H.: Table of percentage points of Kolmogorov statistics. J. Am. Stat. Assoc.
51,111-121 (1956).
Molenaar, W.: Approximations to the Poisson, Bmomial and Hypergeometric Distri-
bution Functions. Amsterdam: MC Tract 31, Mathematical Centre, 1970.
Mood, A. M.: Introduction to the Theory of StatistIcs. New York: McGraw-HIli, 1950.
Mood, A. M. : On the asymptotic efficiency of certain nonparametric two-sample tests.
Annals of Math. Stat. 25, 514-522, (1954).
Mood, A., Graybill, F. A. : Introduction to the Theory of Statistics. New York: McGraw-
Hill, 1963.
Mood, A., Graybill, F. A., Boes, D. c.: I Illroduction to the Theory of Statistics. New
York: McGraw-Hili, 1974.
Moses, L. E.: Nonparametric statistics for psychological research. Psych. Bull. 49,
122-143 (1952).
Moses, L. E.: One sample limits of some two sample rank tests. J. Am. Stat. Assoc. 59,
645-651 (1964).
Mosteller, F.: Clinical studies of analgesic drugs. Biometrics, 8, 220-231 (1952).
Murphy, R. B.: Nonparametnc tolerance lImits. Annals of Math. Stat. 19, 581-589
(1948).
National Bureau of Standards: Tables of the Binomial Probability Distribution. Applied
Mathematics Series 6, Wash. D. c.: U.S. Govt. Printing Office, 1949.
National Bureau of Standards: Handbook of Mathematical Functions Applied Mathe-
matics Series 55, Wash. D. C.: U.S. Govn. Printing Office, 1964.
Neyman, J.: First Course in Probability and Statistics. New York: Holt, 1950.
Neyman, J.: "Inductive behavior" as a basic concept of philosophy of sCIence. Rev.
Int. Stat. Inst. 25,7-22 (1957).
Bibliography 451

Neyman, J. : Tests of statistical hypotheses and their use in studies of natural phenomena.
Comm. in Stat. A, 5, 737-751 (1976).
Neyman, J., Pearson, E. S.: On the problem of the most efficient tests of statistical
hypotheses. Phil. Trans. of the Royal Stat. Soc. A, 231, 289-337 (1933).
Noether, G. E.: Elements of Nonparametric Statistics. New York: John Wiley, 1967.
Noether, G. E.: Some simple distribution-free confidence intervals for the center of a
symmetric distribution. J. Am. Stat. Assoc. 68, 716-719 (1973).
Ord, J. K.: Approximations to distribution functions which are hypergeometric series.
Biometrika, 55, 243-248 (1968).
Owen, D. B.: Handbook of Statisllcal Tables. Reading, Mass. : Addison-Wesley, 1962.
Paulson, E.: An approximate normalization of the analysis of variance distribution.
Annals of Math. Stat. 13,233-235 (1942).
Pearson, E. S.: Some thoughts on statistical inference. Annals ofMath. Stat. 33, 394-403
(1962).
Pearson, E. S., Hartley, H. O. (eds.): Biometrika Tablesfor StatistIcians, Vol. I. Cam-
bridge, England: Univ. Press, 1966.
Peizer, D. B., Pratt, 1. W.: A normal approximation for binomial, F, beta and other
common, related distributions. J. Am. Stat. Assoc. 63,1416-1456 (1968).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popu-
lations. J. Royal Stat. Soc. B, 4, 119-130 (1937a).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popula-
tions, II. The correlation coefficient test. J. Royal Stat. Soc. B, 4, 225-232 (1937b).
Pitman, E. J. G.: Significance tests which may be applied to samples from any popula-
tions, III. The analysis of variance test. Biometrika, 29,322-335 (1938).
Pratt, 1. W.: Remarks on zeros and ties in the Wilcoxon signed ranks procedures. J.
Am. Stat. Assoc. 54,655-667 (1959).
Pratt, J. W.: On interchanging limits and integrals. Annals of Math. Stat. 31, 74-77
(1960).
Pratt, J. W.: Length of confidence intervals. J. Am. Stat. Assoc. 56, 549-567 (1961).
Pratt, J. W.: Robustness of some procedures for the two-sample location problem.
J. Am. Stat. Assoc. 59, 665-680 (1964).
Pratt, J. W.: Bayesian interpretation of standard inference situations, J. Royal Stat. Soc.
B,27, 169-203 (1965).
Pratt, 1. W.: A normal approximation for binomial, F, beta, and other common,
related tail probabilities, II. J. Am. Stat. Assoc. 63,1457-1483 (1968).
Pratt, J. W.: Comment on "Post-data two sample tests of location ", J. Am. Stat. Assoc.
68, 104-105 (1973).
Pratt, 1. W.: A discussIOn of the question: For what use are tests of hypotheses and tests
of significance. Comm. in Stat. A, 5, 779-787 (1976).
Pratt, 1. W.: Concavity of the log likelihood. J. Am. Stat. Assoc. 76,103-106 (1981).
Pratt,1. W., Raiffa, H., Schlaifer, R.: The foundations of decision under uncertainty:
An elementary exposition. J. Am. Stat. Assoc. 59, 353-375 (1964).
Putter, J. : The treatment of ties in some nonparametric tests. Annals of Math. Stat. 26,
368-386 (1955).
Pyke, R. : The supremum and infimum of the Poisson process. Annals of Math. Stat. 30,
568-576 (1959).
Quade, D.: On the asymptotic power of the one-sample Kolmogorov-Smirnov tests.
Annals of Math. Stat. 36,1000-1018 (1965).
Rahe, A. J.: Tables of critical values for the Pratt matched pair signed rank statistic.
J. Am. Stat. Assoc. 69,368-373 (1974).
Raiffa, H., Sch1aifer, R.: Applied Statistical Decision Theory. Boston: Div. Res.,
Harvard Business School, 1961.
Roberts, H. V.: For what use are tests of hypotheses and tests of significance. Comm.
in Stat. A, 5, 753-761 (1976).
452 BIbliography

Rosenbaum, S.: Tables for a nonparametric test of location. Annals of Math. Stat. 25,
146-150 (1954).
Rustagi. J. S.: Bounds for the variance of Mann-Whitney statistics. Annals of Math.
Stat. 13, 119-126 (1962).
Sandiford, P. J.: A new binomial approximation for use in sampling from finite popu-
lations. J. Am. Stat. Assoc. 55, 718-722 (1960).
Savage, I. R.: Bibliography of Nonparametric Statistics. Cambridge: Harvard Univ.
Press, 1962.
Savage, L. J. The Foundations of Statistics. New York: John Wiley, 1954.
Scheffe, H.: A useful convergence theorem for probability distributions. Annals of
Math. Stat. 18,434-438 (1947).
Scheffe, H., Tukey, J. W.: Nonparametric estimation, I. Validation of order statistics.
Annals of Math. Stat. 16, 187-192 (1945).
Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. New York: McGraw-
Hill, 1956.
Singer, B.: Distribution-Free Methods for Nonparametric Methods: A Classified and
Selected Bibliography. Leicester: British Psych. Soc. 1979.
Smirnov, N. V.: Estimate of deviation between empirical distribution functions in two
independent samples (in Russian). Bull. Moscow Univ., 2, 3-16 (1939).
Smirnov. N. V. : Approximation of distribution laws of random variables from empirical
data (in Russian). Uspehi Mat. Nauk, 10, 179-206 (1944).
Steck, G. P.: The Smirnov two sample tests as rank tests. Annals of Math. Stat. 40, 1449-
1466 (1969).
Stein, C.: Efficient nonparametric testing and estimation. Proc. Third Berkeley Symp.
on Math. Stat. and Probability, Vol. I. Berkeley: Univ. Calif., 1956, pp. 187-195.
Sterne, T. E.: Some remarks on confidence or fiducial limits. Biometrika, 41, 275-278
(1954).
Stuart, A. : The comparison offrequencies in matched samples. British J. Stat. Psych. 10,
29-32 (1957).
Tate, M. W., Clelland, R. C.: Nonparametric and Shortcut Statistics, Danville, Ill.:
The Interstate Publishers & Printers, 1957.
Teichroew, D.: Tables of expected values of order statistics and products of order
statistics for samples of size twenty and less from the normal distribution. Annals of
Math. Stat. 27, 410-426 (1956).
Terry, M. E.: Some rank order tests which are most powerful against specific parametric
alternatives. Annals of Math. Stat. 23, 346-366 (1952).
Tsao, C. K. : An application of Massey's distribution of the maximum deviation between
two sample cumulative step functions. Annals of Math. Stat. 25, 587-592 (1954).
Tukey, J. W. : Nonparametric estimation, II. Statistically equivalent blocks and tolerance
regions-the continuous case. Annals of Math. Stat. 18,529-539 (1947).
Uhlmann, W.: Vergleich der hypergeometrischen mit der Binomial-Verteilung.
Metrlka, 10, 145-158 (1966).
Uzawa, H.: Locally most powerful rank tests for two-sample problems. Annals of
Math. Stat. 31, 685-702 (1960).
van der Vaart, H. R.: Some extensions of the idea of bias. Annals of Math. Stat. 32,
436-447 (1961).
van der Waerden, B. L.: Order tests for the two-sample problem and their power, I, II,
III. Proc. Koninklijke Nederlandse Akademie van Wetenschappen (A), 55 (lnda-
gationes Mathematlcae, 14),453-458 (1952); lndagationes Mathematicae 15, 303-310,
311-316 (1953); correctIOn, lndagationes Mathematicae 15, 80 (1953).
van der Waerden, B. L. : Testing a distribution function. Proc. Koninklijke Nederlandse
Akademle van Wetenschappen (A), 56 (lndagationes Mathematicae 15), 201-207
(1953).
Bibliography 453

van der Waerden, B. L.: The computation of the X-distribution. Proc. Third Berkeley
Symp. Math. Stat. and Probability, Vol. I. Berkeley: Univ. Calif., 1956, pp. 207-208.
van der Waerden, B. L., Nievergelt, E.: Tafeln Zum Vergleich Zweier Stichproben
mittels X-test und Zeichentest. Berlin-Gottingen-Heidelberg: Springer-Verlag, 1956.
van Eeden, c.: The relation between Pitman's asymptotic relative efficiency of two tests
and the correlation coefficient between their test statistics. Annals of Math. Stat.
34, 1442-1451 (1963).
von Mises, R.: Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und
theoretischen Physik. Leipzig-Wien: F. Deuticke, 1931.
Walsh, J. E.: ApplIcations of some significance tests for the median which are valid under
very general conditions. J. Am. Stat. Assoc. 44, 342-355 (1949a).
Walsh, J. E.: Some significance tests for the median which are valid under very general
conditions. Annals of Math. Stat. 20, 64-81 (l949b).
Walsh, J. E.: Nonparametric tests for median by interpolation from sign tests. Annals
of the Inst. of Stat. Math.H, 183-188 (1959-60).
Walsh, J. E.: Handbook of Nonparametric Statistics, I. Investigation of Randomness,
Moments, Percentiles and Distributions. New York: Van Nostrand, 1962a.
Walsh, J. E.: Some two-sided distribution-free tolerance intervals of a general nature.
J. Am. Stat. Assoc. 57, 775-784 (1962b).
Walsh, 1. E.: Handbook of Nonparametric Statistics, II: Results for Two and Several
Sample Problems, Symmetry and Extremes. New York: Van Nostrand, 1965.
Walsh, 1. E.: Handbook of Nonparametric Statistics, III: Analysis of Variance. New
York: Van Nostrand, 1968.
Wilcoxon, F. : Individual comparisons by ranking methods. Biometrics, 1, 80-83 (1945).
Wilks, S. S.: Mathematical Statistics. New York: John Wiley, 1962.
Wilson, E. B., Hilferty, M. M.: The distribution of chi-square. Proc. Nat. Academy of
Sci. 17, 684-688 (1931).
Wise, M. E.: A quickly convergent expansion for cumulative hypergeometric proba-
bilities, direct and inverse. Biometrika, 41, 317-329 (1954).
Young, W. H.: On semiintegrals and oscillating successions offunctions. Proc. London
Math. Soc. (2),9,286-324 (1911).
Zahl, S.: Bounds for the Central Limit Theorem error. Ph.D. Dissertation, Boston,
Mass.: Harvard Univ. 1962.
Index

Adapti ve test 216 randomized distributIOn 382 - 383


Admissibility, of one-tailed test 53-54 random effects models 395 - 396
of two-tailed test 63 -65 signed rank procedures 386 - 392
Alling, D W. 290 sign test 354,356,357,359,365,380
AlternatIve dIstributIon 14 student's t test 351 - 352, 364, 365.
Alternative hypothesis 14,231 380
one-sided 223, 311 sum of scores test 400-401, 409
two-sided 223 - 224, 312 - 313 sum of signed constants test 380,
shift assumption 231, 232-234, 249, 381-382,383,384.387.409
297 tests 357-358.371-373
Anderson, T. W 335, 336, 344 two-sample problem 359-360.
Asymptotic dIstribution 72, 73 398-401
Asymptotic efficiency 333, 353, 386 Wilcoxon rank sum test.
AsymptotIc power 347-352 two-sample 360. 400
Asymptotic relative effiCIency 232, Wilcoxon signed rank test,
345-412 one-sample 359. 360. 362 - 363.
confidence intervals 346, 376 379-380,380-381,383,384.386.
confidence limits 362-364, 374-375 389- 391, 400
defInition 346 Average rank procedure (see Ties)
estimates 355 - 362, 373 - 374
Kolmogorov-Smirnov test 401-412
matched paIrS expenments 394- 398 Bahadur, R. R. 333.370
medIan procedures 365-370,379, Barton. D E. 323
380,383,384,386,389,390,401 Bauer. D. F. 175
median test 360,397,402-403,404, Bernoulli trials 2, 6. 18
405,406,407,412 Beta distribution 95
Mood's definition 377 Bias 7-8
normal scores test 380,382,383.384. BInomial approximations 240-241
386,389-390.391-392 BinomIal distribution 4
normal theory tests 351. 354, 356. confidence regIOns 41-44
357,364,365-370,379,380.383. cumulative distribution functIOn 6
384,386,389,390, 391, 400 expected value and variance 7
one-sample problem 359 normal approximatIOn to 66-69

455
456 Index

Binomial distribution (cOlli.) Conditional unbiased ness 100


one-tailed tests 22- 23 Confidence bounds (see confIdence
Poisson approximation to 66-69 limits)
probability of type II error 17 -19 Confidence intervals 263 - 265, 42, 231
randomized tests 37 - 38 asymptotic relative efficiency 346,
significance levels 26 - 27 376
tables 428-430,431-432 binomial test, one tailed 54-55
two-tailed tests 28-29,31 binomIal test, two tailed 61-63
true confidence levels 48 -49 center of symmetry 157 - 158
BinomIal test 118 quantiles, one sample 92-97
confIdence intervals 93, 41-44, 46 quantiles, two sample 242-243
one-tailed test 22-23,52-58 signed rank tests 174 -175
sIgn test 88 Wilcoxon rank sum test. two
two-taIled te~t 28-29, 58-65 sample 253 - 255
BIrnbaum, A 25, 99 Wilcoxon signed rank test, one
Birnbaum, Z. W. 263,265,334,335, sample 157 - 158
343, 422 Wilcoxon signed rank test with ties, one
Blackman, J. 322 sample 161, 169
Blyth, C. R. 370 Confidence levels 119, 45
Bradley, J. V. 83 and tolerance regions 119 -130
Bra~camp, H. J. 423 Confidence limits 42-43
Bross, I D J. 21 and asymptotic relative
efficiency 362-364, 374-375
ConfIdence region 41 - 52
Camp, B H 44 Conover. W. 1. 170
Capon, J 255, 333, 404 Conservative level (see also level) 19
Carvalho, P E. de 0 322-323,326 Conservative method of breaking
Central limit theorem (see also Liapounov tics (see ties)
central limIt theorem) 73 Consistency 8
Chapman, D. G 343 two-sample medIan tests 245 - 246
Chernoff, H 25, 255, 392 two-sample Wilcoxon rank sum 257
Chi-square dIstribution 44 one-sample Wilcoxon signed
Chi-square goodness of fit rank 153-155
statIstic 318-319 Kolmogorov - Smirnov, two-sample
ChI-square statIstIc 239 tests 331- 334
Chi-square test for equality of two Continuity correction (see also nOlmal
proportIons 238, 240 approximation) 68
Chi-square test of independence 109, Continuous random variables 6
110, 115 Convergence of frequency
Claypool, P. L 151 functIOns 69-72
Clopper, C. 1. 43,61 Convergence in distributIOn 72-73
Clopper Pearson charts 43, 44 Convergence in probability 153 -155
Cochran, W. G. 107 Coverage of a region 118
Cochran's Q test 140, 110 Cox, D. R. 17,107,116,117
Combining indivIdual tests 28, 29 Cox model 107, 116-118
Complete family of dIstributions 13-14 Cramer, H 73, 344
Completeness 53-54,63-65 Cramer-von Mises statistics 318, 335,
ConditIOnal level 100 344
CondItIOnal power 100 Critical functIOn 15
ConditIOnal sIgn test (sec also sign test of one-sample observation
with zero differences, randomizatIOn test 217
onc-~ample) 98 -104 of two-sample observatIOn
Conditional tests (see randomization randomization test 307
tests) Critical region 15
Index 457

Critical values 29 Feller, W. 69,330


and two-tailed tests 29 Fellingham, S. A. 151
Crow, E. L. 62 Festinger, L. 249
Cumulative distribution functIOn 4, 5 FIsher information 351,385, 386
Fisher, R. A. 148, 184, 204, 214, 226,
230,239,267,427
Daniels, H. E. 343 Fisher's exact test 234,238-241,246,
Darling, D A 323,335,344 109, 110
Dempster, A. P. 335,336,343 Fisher-Yates-Terry test 267-268,273,
Density function (see probability densIty 277,333
function) Fisz, M. 73
Depaix, M. 323 Fix, E 252, 286
Deuchler, G 249 Fixed effects model 395 - 396
Discrete probability distribution 4 Fraser, D. A. S. 130,221
and p-value 26 Frequency function 4
Distribution (see beta, binomial
chi-square, double exponential,
exponential, discrete, Gastwirth,J. L. 280,281,282,392,419
hypergeometric, normal) Gibbons, J. D. 28,32,255,267,277
Distribution-free procedures 82 Gnedenko, B. V. 322
Dixon, W. J. 333 Good, I. J. 26
Double exponential distribution 368 Goodness of fit 318,319
Donsker, M. D. 330 Graybill, F. A. 243
Doob, J. L 73,330 Groeneboom, P. 370
Double exponential distribution 368 Gur1and, J. 419
Drion, E F. 322, 326 Guttman, I 130
Dwass, M. 211,343
Hajek, J 323, 409
Edgeworth expansion 151 Halperin, M 280
Edgeworth, F. Y. 230 Harter, H. L. 149,253,267,322,434,
Efficacy 349, 350, 351 444
Efron, B 26 Hartigan, J. A. 209
Ellison, B. E 90, 264 Hartley, H. 0 44
Empirical dIstribution function 319 - 320 Harvard University 6,41
Epstein, B. 333 Hilferty, M M 44
Equal tails criterion (see also two-tailed Hodges, J. L., Jr. 252,286,322,323,
tests) 58 324, 337, 339, 370
Equivalent tests 22 Hoeffding, W. 88, 132,255
Errors of the first and second kind (see Hogg, R. V. 216,393
type I and type II errors) Holbert, D. 151
Estimate 6 Hollander, M. 420
Estimator 6 Huber, P. J 230
and asymptotic relative Hypergeometric dIstribution 238 - 239
efficiency 355-362,373-374 tables 435-436
Exact confidence level 45 Hypothesis (see alternative hypothesis,
Exact level (see also level) 19 null hypothesis)
Expected value 7 Hypothesis test (see signifIcance test)
Exponential distribution 104, 232, 366,
368-369 Iman, R. E. 151

Jeffreys, H. 26
Factorization CrIterion (see also Jennrich, R. I. 322
sufficiency) 10 Johnson, N. L. 241, 284
458 Index

Kac, M. 330 Loeve, M. 73


Kadane, J. B 17 Logistic distribution 273
Karlin, S. 52, 64, 65
Kempthorne, O. 17
Kim, P. J. 322 Madansky, A 50
Kimball, A. W. 280 Mallows, C. L. 323
Klose, O. M. 265, 422 Manis. M. 206
Klotz, J 184, 333 Mann, H. B. 249, 253
Kolmogorov, A. N. 330, 334 Mann- Whitney test (see Wilcoxon rank
Kolmogorov - Smirnov statistic, sum test, two-sample)
one-sample 318,319,331. Mass function (see frequency function)
334-336 Massey, F. J. 322,332,334,335
Kolmogorov - Smirnov statistic, Matched pair experiment 105, 106,209,
two-sample 318,319,320-334, 221-222
380 and asymptotic relative
asymptotic relative efficiency 394 - 398
efficiency 401-412 McCornack, R. L. 151
consistency and power 331- 334 McNemar, Q 107
location alternatives 321 McNemar test (see test for equality of
null distribution of 322 - 324, proportions)
325-330 Mean squared error 8, 12
tables 443 -444 Median procedures (see also sign test) and
one-tailed and two-tailed tests 325 asymptotic relative
ties 330 - 331 effiCiency 365-370,379,380,383,
two-sample rank tests 322 384, 386, 389, 390, 401
Korolyuk, V. S. 322,324,339 Median test 231,234,236,237,249,
Kotz, S. 241, 284 267,333
Kraft, C H. 442 asymptotic relative efficiency 397,
Kruskal, W. H. 17,249 400,402-407,412
confidence intervals 242-243
consistency 245 - 246
Lancaster, H. O. 27 control median test 237, 243
Laplace distribution (see exponential first median test 237, 243
distribution) optimal properties 246, 247
Lehmann alternatives 333 power 243-244
Lehmann, E. L. 52,54,57,64, 104, ties 241-242
218, 221, 333, 370 Median unbiased estimator 125
Level of a test (see also confidence level, Mid p-value (see also P-value) 27
significance level) 19 - 20 Miller, L. H. 325, 335
Liapounov central limit theorem (see also MinImum variance unbiased estimator 7,
central limit theorem) 74 12-14
Likelihood ratio 56-57 Modified ranks 158
Lieb, E. H. 423 Modified Wilcoxon signed rank
Lieberman, G. J. 239, 241, 436 test 140-158, 172, 176
limiting efficiency per observation (see Molenaar 241
asymptotic efficiency) Mood, A. M. 243, 377
Lindley, D. V. 20,26 Mood's definition (see also asymptotic
Ling, R. F. 241,428 relative efficiency) 377
Locally most powerful tests Moses, L. E 212,287,303
rank tests, two-sample 272 - 279 Mosteller, F. 107, 120, 235, 237
signed rank tests, Murphy, R. B. 124
one-sample 181-185, 392-394
Location, tests for 145-185,205-216, National Bureau of Standards 6, 41
231-279,297 Next p-value (see also P-values) 27
Index 459

Neyman, J. 17 modifications 302 - 303


Neyman - Pearson fundamental One sample problem 82 - 104, 145 - 185,
lemma 56-58,64,104,222,272, 203-226,318,334-336
311 and asymptotic relative effiCiency for
NievergeIt, E 184, 267 shift families 359, 378 - 394
Noether, G. E. 176 One-tailed tests 22 - 23, 52 - 58
Nominal level 19 Oosterhoff. J. 370
and power 21 Ord, J K. 240
Nominal confidence level 45 Order statistics 93, 94-96
Nonparametric procedure 82-83, 318 Owen, D. B. 149, 239, 241, 253, 322,
Normal approximation 434, 436, 444
binomial 67-69,333
hypergeometric distribution 240, 241
randomization distribution, one P-value, one-tailed 23-28
sample 212 - 213 two-tailed tests 29- 32
randomization distributIOn, two Paired sample problem 104-118,
sample 303 145-146,147,203
Wilcoxon signed rank test, one Parameter 4
sample 150-151 Paulson, E. 44
Normal distribution 5, 73, 232 Pearson, E. S 17,21,43,44,61
tables 426-427 Pearson's chi-square goodness of fit
Normal scores 267 statistics (see chi-square goodness
Normal scores test 267 - 268 of fit statistic)
and asymptotic relative efficiency 380, Peizer, D. B. 241, 427, 428
382-384,386,389-390,391-392 Permutation test (see randomization test)
Normal theory procedures (see also Permutation invariant procedures 17 5,
student's I-test) 177-181
and asymptotic relative efficiency 351, observation randomization tests,
354,356,357,364,365-370,379, one-sample 217, 219 - 220
380,383,384,386,389,390,391, observation randomization tests,
400 two-sample 306, 310
Null distrIbutIOn 14 rank tests, two sample 269-272
Kolmogorov-Smlrnov test 322-324, signed rank tests,
325-330 one-sample 177 - 179
rank sum statistics, Pitman, E. J. G. 204
two-sample 252 - 253 Pitman efficiency (see asymptotic
Null hypothesis 14, 15 -16 efficiency)
Null probability 17 Pitman tests (see randomization tests)
Pitman's formula 353, 377 - 378, 380
Poisson distribution 66-67
Observations 3 Power (see also asymptotic power) 21,
Observation-randomization tests, 22
one-sample 204, 216-225 binomial tests 23
approximations and Kolmogorov-Smirnov tests,
modifications 210-216 two-sample 331- 334
confidence intervals for median test 242 - 243
mean 209 - 210 observation randomization
on mean 205-216 test 216-217
rank randomization tests, rank tests, two-sample 272 - 279
one-sample 224-226 sign tests, two-sample 243 - 244
Observation-randomization tests, type I and type II errors 21
two-sample 232, 305-313 Wilcoxon rank sum test,
confidence intervals 300-301 two-sample 255- 257
difference of sample means 297 - 305 Wilcoxon signed rank test,
460 Index

W l!coxon signed rank test (COlli.) Rank-randomization test. one sample (see
one-sample 151 -153 also signed rank test) 204
Power efficiency 346 Rank-randomIZatIOn test. two sample (see
Power Laplace distributIOn (see double also rank test) 232. 296
exponential distribution) Rank sum test 77
Pratt. J. W. 17.20.26.28.32.44.50. Rank tests. two-sample 231-279.322
51.71.162.241.247,257,401,421. Reduced sample procedure (see ties)
423. 427. 428. 432 Regression model 117
Principle of invariance 178. 180-181. Rejection rule 14-15
218.223-224,308.312-313 Relative efficiency 345. 346
Principle of minimum likelihood 30. 60 Roberts. H V 17
Probability density function 4 Rosenbaum. S. 237.243.283
Probability integral transformation 95 Rosenbaum's test 243
Pyke. R. 343 Rubin. H. 52. 65
Rustagi. J. S. 265
Quade. D. 402,404, 411
Quantile 83 - 85
Sandiford. P. J. 240
Quantile test, two-sample (see also median
Savage. 1. R. 255. 392
test) 236-248, 267
Savage. L. J. 17.20
Sheffe. H. 71. 130
Shift assumption 23 I. 232- 234, 249,
Rahe, A. 1. 165 297
Random effects model and asymptotic Shift families 379
relative efficiency 395 - 396 Sidak, Z. 323, 409
Random method of breaking ties (see Simple random sample 105
lies) Sign test for quantities.
Random variable 3 one-sample 85 -97, 146
Randomization distribution, confidence intervals 87, 92-96
one-sample 203. 204 optimum properties 88-92
approximations to 212-216 random effects models 92-96
expected value and variance 210 zero differences 97 - 104, 107. 114
(11 !2") type distribution 219 asymptotic relative efficiency 354,
2/1 type distribution 219 356,357.359.365.380
I-statistic 207 - 208 Sign test. two-sample 234-248, 348,
Randomization distribution, 349. 354
two-sample 296, 298-299 asymptotic relative efficiency 380
approximations to 303 - 305 (see also Fisher's exact test, median test)
expected value and variance 301 - 302 Sign test. two-sample with fixed ~ 236,
N! type distribution 306 237.241
(m) type distribution 306 Signed-rank of observations 147 -148
student's I-statistic 303 - 305 Signed-rank sum 148
Randomization test, one-sample (see Signed-rank tests. one-sample 173 - 177 ,
observation-randomization test. 180-181.203.216
one-sample and rank-randomization asymptotic relative
test. one-sample) 203-204.382. efficiency 386 - 392
383 confidence intervals 174-175
Randomization test. two-sample. (see Walsh averages 173-174
observation-randomization test. two Signed-rank zero procedure (see ties)
sample and rank-randomizatIOn test. Significance level (see also level) 19-20
two sample) 296 interpretation 20
Randomized confidence regions 51 - 52 power 21
Randomized P-value 40-41 Significance tests 14- 34
Randomized test procedures 10. 34-41 Size (see level or sigmficance level)
Index 461

Smirnov, N. V 330, 335 conservallve method of tic


Standard deviation 7 breaking 166,170-171
Standard normal densIty 5 random method of tic blcaking 166.
Standard random van able 67 171
Statistic 3 reduced sample procedUl e 162. 167.
StatIstIcal sIgnifIcance 15 170-171
Steck, G. P. 323,330,333,338,339 SIgned-I ank zcro procedure 163 - 165.
Stein, C 216,218 170-171
Stochastic dO/mnancc 231 Tingey, F. H 334, 343
Stoker, D J 151 Tolerance proportion 119
Stuart, A 110 Tolerance regIOns 118-130
Student's 1 statistIc 206,207-208, bivariate 127
212- 215, 303 - 305 Wllk's method 121-124
and asymptotIc relative efficiency 351, Trinomial distributIon 104
352, 364, 365, 380 True confidence level (see also Confidence
Sufficiency 8-12,54, 177-178 level) 45
and order statistics 270 and binomial distribution 48 -49
Sum of scores tests 265 - 268, 273, Tsao, C. K 322
276-279 Tukey, J. W. 130,209
and asymptotic relative Two by two tables 109, 110,234-236
efficiency 400-401, 409 Two-sample problem 231 - 279.
Sum of signed constants test 172-173, 296-313,318,321-334
216 and asymptotic relative efficiency for
and asymptotic Iclative efficiency 380, shift families 398-401, 359-360
381-382,383,384,387,409 Two-tailed tests 17.28-34. 58-65
Symmelly 146-147 Type I and type II errors 17 - 22

Uhlmann, W. 284
T -test (see student 's l-te~t) Unbiasedness of statistic 7
Tables 426-444 confidence level 61-62
Teichroew, D. 267,293 tests 59, 10 I - 104
Terry, M E 267 United States Senate 250
Test statistic 15 Uniformly consistent estimator 154
and asymptotic relative Uniformly most powerful tests 52-53,
efficiency 357 - 358, 371-373 181, 222-224,310-313
Test for equality of Umformly most powerful unbiased
proportions 106-116, 236 test 59
(see also sign test, one-sample with zero Unit normal density (see standard normal
differences) density)
Test for 2 x 2 tables (see x 2 test for Uzawa, H 278
equality of two proportions)
Ties
Valid test 20
in Kolmogorov-Smirnov
van der Vaart, H R 125
tests 330- 331
non-zero 167-168 van der Waerden, B. L. 184.267.330,
sign test, two-sample 241 - 242 333
van der Waerden statistic (see also normal
sign test with zero dIfferences,
scores test) 267 - 268, 333
one-sample 97 - 104
van Eeden, C. 184, 442
Wilcoxon rank sum test,
two-sample 258 - 263 Variance 7
von Mises, R. 344
Wilcoxon signed rank test,
one-sample 160 - 171
average rank procedure 162, Wallace, D. L. 235.237
163-165,167-168,170 Walsh averages 150, 380
462 Index

Walsh averages (COIlt.) asymptotic relative efficiency 359,


confidence intervals for center of 360,362-363,379-381,383,384,
symmetry 157-158 386,389-391,400
sign tests 173 - 177 confidence intervals 157 -158
ties 161 consistency 153 -155
Walsh, J E. 130, 176 expected value and vanance 149,
Whitney, D. R. 249,253 151-153
Wilcoxon, F. 249 power 151-153
Wilcoxon rank sum test, tables 433-434
two-sample 266, 267, 249- 265, Walsh averages 150
333 Wllk's method (sec tolerance regions)
asymptotic relative efficiency 360,400 Wilson, E. B. 44
confidence Intervals 253 - 255 Wise, M E 240
consistency 257 Wolff, S. 392,419
expectation and variance 252-253
power 255-257
relation to Mann-Whitney Yates, F. 184,239,267,427
statistic 250 - 251 Young, W. H. 71
tables, 437 -442
ties 258-263
Wilcoxon rank sum test, Zahl, S. 212
one-sample 145, 147 -171, 172, Zero differences (see tics and conditIOnal
173,174,175,176,209,216,217, sign test)
225
asymptotic null distribution
theory 150-151
Springer Series in Statistics

Measures of Association for Cross Classifications


Leo A. Goodman and William H. Kruskal
19791 146 pp. 1 cloth
ISBN '0-387-90443-3

Statistical Decision Theory: Foundations, Concepts, and Methods


James Berger
1980 1 425 pp. 1 20 iIIus. 1 cloth
ISBN 0-387-90471-9

Simultaneous Statistical Inference, Second Edition


Rupert G. Miller, Jr.
1981/299 pp. 1 25 iIIus. 1 cloth
ISBN 0-387-90548-0

Point Processes and Queues: Martingale Dynamics


Pierre Bremaud
19811 approx. 384 pp. 1 31 iIIus. 1 cloth
ISB N 0-387-90536-7

Non-Negative Matrices and Markov Chains


E. Seneta
1981 /304 pp / cloth
ISBN 0-387-90598-7

Statistical Computing with APL


Francis John Anscombe
19811426 pp./70 iIIus./cloth
ISBN 0-387-90549-9

Concepts of Nonparametrlc Theory


John Pratt and Jean D. Gibbons
19811469 pp'/23 iIlus./cloth
ISBN 0-387-90582-0

You might also like