A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu
A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu
A Parametric
Approach to
Nonparametric
Statistics
Springer Series in the Data Sciences
Series Editors:
Jianqing Fan, Princeton University, Princeton
Michael Jordan, University of California, Berkeley
Ravi Kannan, Microsoft Research Labs, Bangalore
Yurii Nesterov, Universite Catholique de Louvain, Louvain-la-Neuve
Christopher Ré, Stanford University, Stanford
Larry Wasserman, Carnegie Mellon University, Pittsburgh
Springer Series in the Data Sciences focuses primarily on monographs and graduate level textbooks. The target
audience includes students and researchers working in and across the fields of mathematics, theoretical computer
science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the fastestgrowing subjects in in-
terdisciplinary statistics, mathematics and computer science. It encompasses a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and
supporting decision making. Data analysis has multiple facets and approaches, including diverse techniques un-
der a variety of names, in different business, science, and social science domains. Springer Series in the Data
Sciences addresses the needs of a broad spectrum of scientists and students who are utilizing quantitative methods
in their daily research.
The series is broad but structured, including topics within all core areas of the data sciences. The breadth of
the series reflects the variation of scholarly projects currently underway in the field of machine learning.
A Parametric Approach to
Nonparametric Statistics
123
Mayer Alvo Philip L. H. Yu
Department of Mathematics and Statistics Department of Statistics and Actuarial Science
University of Ottawa University of Hong Kong
Ottawa, ON, Canada Hong Kong, China
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In randomized block designs, the Friedman statistic provides a nonparametric test of the
null hypothesis of no treatment effect. This book was motivated by the observation that
when the problem is embedded into a smooth alternative model to the uniform distribu-
tion over a set of rankings, this statistic emerges as a score statistic. The realization that
this nonparametric problem could be viewed within the context of a parametric problem
was particularly revealing and led to various consequences. Suddenly, it seemed that
one could exploit the tools of parametric statistics to deal with several nonparamet-
ric problems. Penalized likelihood methods were used in this context to focus on the
important parameters. Bootstrap methods were used to obtain approximations to the
distributions of estimators and to construct confidence intervals. Bayesian methods were
introduced to widen the scope of applicability of distance-based models. As well, the
more commonly used test statistics in nonparametric statistics were reexamined. The
occurrence of ties in the sign test could be dealt with in a natural formal manner as
opposed to the traditional ad hoc approach. This book is a first attempt at bridging the
gap between parametric and nonparametric statistics and we expect that in the future
more applications of this approach will be forthcoming.
The authors are grateful to Mr. Hang Xu for contributions that were incorporated
in Chapter 10. We are grateful to our families for their support throughout the writing
of this book. In particular, we thank our wives Helen and Bonnie for their patience and
understanding. We are also grateful for the financial support of the Natural Sciences
and Engineering Research Council of Canada (NSERC) and the Research Grants Coun-
cil of the Hong Kong Special Administrative Region, China (Project No. 17303515),
throughout the preparation of this book.
Ottawa, ON, Canada Mayer Alvo
Hong Kong, China Philip L. H. Yu
V
Contents
VII
Contents
VIII
Contents
9. Efficiency 187
9.1. Pitman Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2. Making Use of Le Cam’s Lemmas . . . . . . . . . . . . . . . . . . . . . . 193
9.2.1. Asymptotic Distributions Under the Alternative in the General
Multi-Sample Problem: The Spearman Case . . . . . . . . . . . . 195
9.2.2. Asymptotic Distributions Under the Alternative in the General
Multi-Sample Problem: The Hamming Case . . . . . . . . . . . . 199
9.3. Asymptotic Efficiency in the Unordered Multi-Sample Test . . . . . . . . 203
9.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
IX
Contents
X
Contents
Index 277
XI
Notation
XIII
Notation
XIV
Part I.
1
1. Introduction
This book grew out of a desire to bridge the gap between parametric and nonparametric
statistics and to exploit the best aspects of the former while enjoying the robustness
properties of the latter. Parametric statistics is a well-established field which incorpo-
rates the important notions of likelihood and sufficiency that are part of estimation and
testing. Likelihood methods have been used to construct efficient estimators, confidence
intervals, and tests with good power properties. They have also been used to incorporate
incomplete data and to pool information from different sources collected under different
sampling schemes. As well, Bayesian methods which rely on the likelihood function
can be used to combine information acquired through a prior distribution. Constraints
which restrict the domain of the likelihood function can also be taken into account.
Likelihood functions are Bartlett correctable which helps to improve the accuracy of
the inference. Additional tools such as penalized likelihood via Akaike or Bayesian
information criterion can take into account constraints on the parameters. Recently,
the notion of composite likelihood has been introduced to extend the range of appli-
cations. Problems of model selection are naturally dealt with through the likelihood
function.
A difficulty that arises with parametric inference is that we need to know the
underlying distribution of the random variable up to some parameters. If that dis-
tribution is misspecified, then inferences based on the likelihood may be inefficient
and confidence intervals and tests may not lead to correct conclusions. Hence, when
there is uncertainty as to the exact nature of the distribution, one may alternatively
make use of traditional nonparametric methods which avoid distributional assump-
tions. Such methods have proven to be very efficient in several instances although
their power is generally less than the analogous parametric counterparts. Moreover,
nonparametric statistics rely more on intuition and the subject has developed along in a
nonsystematic manner, always in an attempt to mimic parametric statistics. Boot-
strap methods could also be used in most cases to provide estimates and to con-
struct confidence intervals but their interpretation may not be easy. For example, the
shape of a confidence region for a vector parameter will often appear as a cloud in
space.
To act as a bridge between parametric and nonparametric statistics, Conover and
Iman (1981) used rank transformations in an ad hoc manner. They suggested using
parametric methods based on the ranks of the data in order to conduct nonparametric
analyses. However, as mentioned by Conover and Iman (1981), such an approach has a
number of limitations, for instance the severe lack of robustness for the test for equality
of variances.
In a landmark paper, Neyman (1937) considered the nonparametric goodness of fit
problem and introduced the notion of smooth tests of fit by proposing a parametric
family of alternative densities to the null hypothesis. The type of embedding proposed
by Neyman was further elaborated by Rayner et al. (2009a) in connection with good-
ness of fit problems. In this book, we propose an embedding which focuses on local
properties more in line with the notion of exponential tilting. Hence, we obtain a new
derivation of the well-known Friedman statistic as the locally most powerful test in an
embedded family of distributions. In another direction, we exploit Hoeffding’s change
of measure formula which provides an approach to obtaining locally most powerful tests
based on ranks for various multi-sample problems. This is then followed by applications
of Le Cam’s three lemmas in order to obtain the asymptotic distribution of various statis-
tics under the alternative. Together, these results enable us to determine the asymptotic
relative efficiency of our test statistics.
This book is divided into three parts. In Part I, we outline briefly fundamental
concepts in probability and statistics. We introduce the reader to some of the important
tools in nonparametric statistics such as U statistics and linear rank statistics. In Part II,
we describe Neyman’s smooth tests in connection with goodness of fit problems and we
obtain test statistics for some common nonparametric problems. We then proceed to
make use of this concept in connection with the usual one- and two-sample tests. In
Chapter 6, we present a unified theory of hypothesis testing and apply it to study multi-
sample problems of location. We illustrate the theory in the case of the multi-sample
location problem as well as the problem involving umbrella alternatives. In Chapter 7,
we obtain a new derivation of the Friedman statistic and show it is locally most powerful.
We then make us of penalized likelihood to gain further insight into the rankings selected
by the sample. Chapter 8 deals with locally most powerful tests, whereas Chapter 9 is
devoted to the concept of efficiency. In Part III, we consider some modern applications
of nonparametric statistics. Specifically, we couch the multiple change-point problem
within the context of a smooth alternative. Next, we propose a new Bayesian approach
to the study of ranking problems. We conclude with Chapter 12 wherein we briefly
describe the application of methodology to the analysis of censored data.
4
2. Fundamental Concepts
in Parametric Inference
In this chapter we review some terminology and basic concepts in probability and
classical statistical inference which provide the notation and fundamental background
to be used throughout this book. In the section on probability we describe some basic
notions and list some common distributions along with their mean, variance, skewness,
and kurtosis. We also describe various modes of convergence and end with central limit
theorems. In the section on statistical inference, we begin with the subjects of estima-
tion and hypothesis testing and proceed with the notions of contiguity and composite
likelihood.
Most random variables are either discrete or continuous. We say that a random
variable is continuous if its cdf is a continuous function having no jumps. A continuous
random variable, such as weight, length, or lifetime, takes any numerical value in an
interval or on the positive real line. Typically, a continuous cdf has a derivative except
at some points. This derivative denoted by
d
fX (x) = FX (x) = FX (x),
dx
is called the probability density function (pdf ) of X. The cdf of a continuous random
variable X on the entire real line satisfies
∞ x
fX (x) ≥ 0 fX (t)dt = 1 and FX (x) = fX (t) dt.
−∞ −∞
f (x, y)
f (x|y) = , fY (y) > 0.
fY (y)
∂ p FX (x1 , x2 , . . . , xp )
fX (x1 , x2 , . . . , xp ) =
∂x1 . . . ∂xp
provided the multiple derivatives exist. The p random variables X1 , X2 , . . . , Xp are said
to be independent if their joint pdf is the product of the p individual densities, labeled
marginal pdfs and denoted {fXi (x)}, i.e.,
6
2. Fundamental Concepts in Parametric Inference
Definition 2.1. A random sample of size p of the random variable X is a set of inde-
pendent and identically distributed (i.i.d.) random variables X1 , X2 , . . . , Xp , with the
same pdf as X.
Definition 2.2. The order statistics from a sample of random variables X1 , X2 , . . . , Xp
are denoted X(1) ≤ X(2) ≤ . . . ≤ X(p) and indicate which are the smallest, second
smallest, etc.
Hence, X(1) = min {X1 , X2 , . . . , Xp } and X(p) = max {X1 , X2 , . . . , Xp } .
In probability and statistics, we are interested in properties of the distribution of a
random variable. The expected value of a function g(X) of a real valued random variable
X, denoted by E[g(X)], is defined as
∞
−∞
g(x)f (x) dx if X is continuous
E[g(X)] =
x∈Ω g(x)f (x) if X is discrete.
Another important property is that the moment generating function when it exists is
unique. The moment generating function of the sum of independent random variables is
equal to the product of the individual moment generating functions. This fact, coupled
with the uniqueness property helps to determine the distribution of the sum of the
variables.
7
2. Fundamental Concepts in Parametric Inference
The third and fourth central moments are used to define the skewness and (excess)
kurtosis as
γ = μ3 /σ 3 and κ = μ4 /σ 4 − 3
respectively. The skewness measures the “slant” of a distribution with γ = 0 for a
symmetric distribution. When γ < 0, the distribution is slanted to the left (with a
long tail on the left) and when γ > 0, it is slanted to the right (with a long tail on
the right). The kurtosis measures the “fatness” of the tails of a distribution. A positive
value indicates a heavier tail than that of a normal distribution whereas a negative value
points to a lighter tail.
Knowledge of the mean, variance, skewness, and kurtosis can often be used to ap-
proximate fairly well a given distribution (see Kendall and Stuart (1979)). Table 2.1
exhibits some important pmfs/pdfs along with their mean, variance, skewness, and
(excess) kurtosis. The multinomial distribution generalizes the binomial for the case
of r categories.
Using the linear properties of the expectation operator, we can determine the ex-
pected value and moments of a linear combination of a set of random variables. For
instance, for p random variables X1 , . . . , Xp and constants ai , i = 1, . . . , p, we have,
p
p
E ai X i = ai E [Xi ]
i=1 i=1
p
p
p
p
V ar( ai X i ) = a2i V ar(Xi ) + 2 ai aj Cov(Xi , Xj ),
i=1 i=1 i=1 j=i+1
where
Cov(Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] .
Some useful results in connection with conditional distributions are in the following
theorem.
Theorem 2.1. Let X and Y be two random variables defined on the same probability
space. Then
(a) E [Y ] = E [E [Y |X]],
(A + B) (B − A)2
, .
2 12
8
Table 2.1.: Some important discrete and continuous random variables
Name pmf / pdf Mean Variance Skewness Kurtosis
Discrete
Uniform on 2
+1)
1 m+1 m2 −1 − 6(m
1, 2, . . . , m m 2 12
0 5(m2 −1)
n x
x
p (1 − p)n−x 1−6p(1−p)
√ 1−2p
Binomial x = 0, 1, . . . , n, 0 ≤ p ≤ 1 np np(1 − p) np(1−p) np(1−p)
n!
px1 px2 2 . . . pxr r ,
xr1 !x2 !...xr ! 1 r
Multinomial i=1 xi = n, i=1 pi = 1
Continuous
9
Uniform on
1 a+b (b−a)2
(a, b) b−a
, a<x<b 2 12
0 − 65
2
√ 1
exp − (x−μ) ,
2πσ 2 2σ 2
Normal −∞ < x, μ < ∞, σ > 0 μ σ2 0 0
1 1
Exponential λe−λx , x ≥ 0,λ > 0 λ λ2
2 6
2. Fundamental Concepts in Parametric Inference
β α α−1 −βx α α √2 6
Gamma Γ(α)
x e , x ≥ 0,α, β > 0 β β2 α α
m x
1 8
m x 2 −1 e− 2 ,x ≥ 0,m > 0 12
Chi-square 22Γ m
2( ) m 2m m
m
Table 2.1.: Continued.
Name pmf / pdf Mean Variance Skewness Kurtosis
Discrete
1
Laplace 2σ
exp − |x−μ|
σ
, −∞ < x, μ < ∞, σ > 0 μ 2σ 2 0 3
10
exp(− x−μ
σ )
2 , −∞ < x, μ < ∞, σ > 0 π2 σ2
Logistic σ (1+exp(− x−μ
σ )) μ 3
0 1.2
Γ( ν+1
− ν+1
2
2 ) x2
√
πνΓ( ν2 )
1+ ν
ν 6
Student’s t , −∞ < x < ∞, ν > 0 0, ν > 1 ν−2
,ν >2 0 ν−4
,ν >4
2. Fundamental Concepts in Parametric Inference
2. Fundamental Concepts in Parametric Inference
We shall also need to make use of the distribution of the order statistics which is
given in the lemma below.
n!
fX(i) (x) = [F (x)]i−1 f (x) [1 − F (x)]n−i , i = 1, . . . , n. (2.1)
(i − 1)! (n − i)!
Proof. An intuitive proof may be given by using the multinomial distribution. The
probability that X(i) lies in a small interval around x implies that there are (i − 1)
observations to the left of x and (n − i) to the right of x.
11
2. Fundamental Concepts in Parametric Inference
Then
∞
∞
P lim sup An =P Ak = 0.
n−→∞
n=1 k≥n
The notation lim supn−→∞ An indicates the set of outcomes which occur infinitely often
in the sequence of events.
P (Xn ≤ x) → P (X ≤ x)
P (|Xn − X| > ε) → 0
for ε > 0.
We shall say that a sequence of random variables Xn converges almost surely to X,
a.s.
denoted Xn −−→ X, if as n → ∞
P lim |Xn − X| > ε = 0
n−→∞
for ε > 0.
Convergence almost surely implies convergence in probability. On the other hand, if
a sequence of random variables converges in probability, then there exists a subsequence
which converges almost surely. As well, convergence in probability implies convergence
in distribution. The following inequality plays a useful role in probability and statistics.
Lemma 2.3 (Chebyshev Inequality). Let X be a random variable with mean μ and
finite variance σ 2 . Then for ε > 0,
σ2
P (|X − μ| ≥ ε) ≤ .
ε2
As an application of Chebyshev’s inequality, suppose that X1 , . . . , Xn is a sequence of
independent identically distributed random variables having mean μ and finite variance
σ 2 . Then, for ε > 0,
σ2
P X̄n − μ > ε ≤ 2
nε
P
from which we conclude that X̄n −
→ μ as n → ∞. This is known as the Weak Law of
Large Numbers.
12
2. Fundamental Concepts in Parametric Inference
There is as well the Strong Law of Large Numbers (Billingsley (2012), Section 22, p.
301) which states that if X1 , . . . , Xn is a sequence of independent identically distributed
random variables with mean μ for which E |Xi | < ∞, then for ε > 0.
P lim X̄n − μ > ε = 0.
n−→∞
a.s.
That is X̄n −−→ μ. The single most important result in both probability and statistics
is the central limit theorem (CLT). In its simplest version, it states that if {X1 , . . . , Xn }
is a sequence of independent identically distributed (i.i.d.) random variables with mean
μ and finite variance σ 2 , then for large enough n,
√
n X̄n − μ L
−
→ N (0, 1) ,
σ
where N (0, 1) is the standard normal distribution with mean 0 and variance 1. Since the
assumptions underlying the CLT are weak, there have been countless applications as in
for example approximations to the probabilities of various events involving the sample
mean. As well, it has been used to approximate various discrete distributions such as
the binomial, Poisson, and negative binomial. An important companion result is due to
Slutsky (Casella and Berger (2002), p. 239) which can often be used together with the
central limit theorem.
L P
Theorem 2.2 (Slutsky’s Theorem). Suppose that Xn −
→ X, and Yn −
→ c for a constant
c. Then,
L
Xn Y n −
→ cX
and
L
Xn + Y n −
→ X + c.
Moreover, if c = 0,
L
Xn /Yn −
→ X/c.
A direct application of Slutsky’s theorem is as follows. Let {X1 , . . . , Xn } be a se-
quence of i.i.d. random variables having finite variance σ 2 , n > 1 and let
n n 2
Xi i=1 Xi − X̄n
X̄n = i=1
, Sn2 =
n n−1
be the sample mean and sample variance respectively. Then, it can be shown that
2 1 n−3 4
V ar Sn = μ4 − σ
n n−1
σ4 2n
= κ+
n n−1
13
2. Fundamental Concepts in Parametric Inference
P
Sn2 −
→ σ2
An additional result known as the Delta method enables us to extend the CLT to
functions (Casella and Berger (2002), p. 243).
Theorem 2.3 (Delta Method). Let {Yn } be a sequence of random variables having mean
θ and finite variance σ 2 and for which
√ L
→ N 0, σ 2 , as n → ∞.
n (Yn − θ) −
The proof can be obtained by applying the first order Taylor expansion to g (Yn ).
Example 2.2. Let {X1 , . . . , Xn } be a random sample from the Bernoulli distribution
for which Xn = 1 with probability θ and Xn = 0 with probability 1 − θ. Let
g (θ) = θ (1 − θ) .
√
n(X̄n −θ ) L
The CLT asserts that √ −
→ N (0, 1) , as n → ∞. Then using the Delta method
θ(1−θ)
the asymptotic distribution of g X̄n is
√ L
2
n g X̄n − g (θ) −→ N 0, θ (1 − θ) [g (θ)]
provided g (θ) = 0. We note, however, that at θ = 12 , g 12 = 0. In order to determine
the asymptotic distribution of g X̄n at θ = 12 , we may proceed by using a second-order
Taylor expansion
2
1 1 1 1 1 1
g X̄n = g +g X̄n − + g X̄n −
2 2 2 2 2 2
2
1 1
= + 0 − X̄n − .
4 2
14
2. Fundamental Concepts in Parametric Inference
Theorem 2.4 (Lindeberg-Feller). Suppose that {Xi } are independent random vari-
ables with means {μi }, finite variances {σi2 }, and distribution functions {Fi } . Let Sn =
n n
i=1 (Xi − μi ), Bn =
2 2
i=1 σi and suppose that
σk2
max →0 as n → ∞.
k Bn2
The Lindeberg condition (2.2) is implied by the Lyapunov condition which states that
there exists a δ > 0 such that
n 2+δ
i=1 E |Xi − μi |
→ 0 as n → ∞.
Bn2+δ
Remark 2.1. Note that the condition maxk σk2 /Bn2 → 0 as n → ∞ is not needed for the
proof of the “if” part as it can be derived from (2.2).
Example 2.3. (Lehmann (1975), p. 351) Let {Y1 , . . . , Yn } be a random sample from the
Bernoulli distribution for which Yn = 1 with probability θ and Yn = 0 with probability
1 − θ. Set Xi = iYi . We would like to determine the asymptotic distribution of ni=1 Xi .
We note that μi = iθ and σi2 = i2 θ (1 − θ). Consequently,
n
n (n + 1) (2n + 1)
Bn2 = σi2 = θ (1 − θ) .
i=1
6
15
2. Fundamental Concepts in Parametric Inference
(Xi − μi )2 < n2 ,
and consequently,
n
i=1 E (Xi − μi )2 I (|Xi − μi | > Bn )
lim = 0.
n−→∞ Bn2
L
Therefore, applying Theorem 2.4, Bn−1 ( ni=1 Xi − θn(n + 1)/2) −
→ N (0, 1) for large n.
We note that in this example, the Lyapunov condition would not be satisfied.
Z ∼ Nq (Aμ + b, AΣA ) .
16
2. Fundamental Concepts in Parametric Inference
Example 2.4 (Distribution of Quadratic Forms). We cite in this example some well-
known results on quadratic forms in normal variates. Let Y ∼ Np (μ, Σ) where Σ is
positive definite. Then,
(c) Y AY ∼ χ2r (δ) if and only if AΣA = A, that is, AΣ is idempotent and rank A =
r. Here, A is a symmetric p × p positive definite matrix of constants,χ2r (δ) is the
noncentral chi-square distribution, and δ = μ Aμ is the noncentrality parameter.
Most statistical inference is concerned with using the sample to obtain knowledge about
the parameter θ. It sometimes happens that a function of the sample, labeled a statistic,
provides a summary of the data which is most relevant for this purpose.
17
2. Fundamental Concepts in Parametric Inference
Definition 2.5. We say that a statistic T (X1 , . . . , Xn ) is sufficient for θ if the condi-
tional density of X1 , . . . , Xn given T (X1 , . . . , Xn ) is independent of θ.
The factorization theorem characterizes this concept in the sense that T is sufficient
if and only if there exists a function h of t = T (x1 , . . . , xn ) and θ only and a function g
such that
f (x1 , . . . , xn ; θ) = h (t, θ) g (x1 , . . . , xn ) .
The concept of sufficiency allows us to focus attention on that function of the data which
contains all the important information on θ.
There are some desirable properties of estimators which provide guidance on how
to choose among them. An estimator T (X1 , . . . , Xn ) is said to be unbiased for the
estimation of g (θ) if for all θ
Eθ [T (X1 , . . . , Xn )] = g (θ) .
Example 2.5. Let {X1 , . . . , Xn } be a random sample drawn from a distribution with
population mean μ and variance σ 2 . It is easy to see from the properties of the expec-
tation operator that the sample mean X̄n = n1 ni=1 Xi is unbiased for μ:
1 1
n n
E X̄n = E [Xi ] = μ = μ.
n i=1 n i=1
n
i=1 (Xi − X̄n ) is unbiased for σ since Sn can be
1
Also, the sample variance Sn2 = n−1 2 2 2
reexpressed as n
1
Sn2 = (Xi − μ)2 − n(X̄n − μ)2 ,
n − 1 i=1
and
E (Xi − μ)2 = V ar(Xi ) = σ 2
σ2
E (X̄n − μ)2 = V ar(X̄n ) = .
n
A desirable property of an unbiased estimator is that it should have the smallest
variance
Eθ (T (X1 , . . . , Xn ) − g (θ))2 .
A further desirable property of estimators is that of consistency.
18
2. Fundamental Concepts in Parametric Inference
where h (x) , η (θ) , t (x) , K (θ) are known functions. We will assume for simplicity that
η (θ) = θ.
It follows that
Eθ [t (X)] = K (θ)
V arθ (t(X)) = K (θ) .
The density f (x; θ) is sometimes called the exponential tilting of h (x) with respect to
the mean K and variance K. The exponential family includes as special cases several
of the commonly used distributions, such as the normal, exponential, binomial, and
Poisson.
Suppose that we have a random sample from the exponential family. It follows that
the joint density is given by
n
f (x1 , . . . , xn ; θ) = h (x1 , . . . , xn ) exp θ t (xi ) − nK (θ)
i=1
which can again be identified as being a member of the exponential family. An important
result in the context of estimation is the following.
Theorem 2.6 (Cramr-Rao Inequality). Suppose that {X1 , . . . , Xn } is a random sample
from a distribution having density f (x; θ) . Under certain regularity conditions which
permit the exchange of the order of differentiation and integration, the variance of any
estimator T (X1 , . . . , Xn ) is bounded below by
[b (θ)]2
V arθ (T (X1 , . . . , Xn )) ≥ , (2.5)
I (θ)
The expression in (2.6) is known as the Fisher information and it plays a key role in
estimation and hypothesis testing. The regularity conditions are satisfied by members
19
2. Fundamental Concepts in Parametric Inference
and
2
∂ log f (X1 , . . . , Xn ; θ) ∂ 2 log f (X1 , . . . , Xn ; θ)
I (θ) = Eθ = −Eθ < ∞.
∂θ ∂θ2
Example 2.6. Suppose we have a random sample from the normal distribution with
mean μ and variance σ 2 . Then, X̄n is a consistent and unbiased estimator of the popu-
lation mean μ whose variance is σ 2 /n. The Fisher information can be calculated to be
nσ −2 and hence the Cramr-Rao lower bound for the variance of any unbiased estimator
is σ 2 /n. Consequently, the sample mean has the smallest variance among all unbiased
estimators.
b (θ) = Eθ [T (X1 , . . . , Xn )] .
Then under certain regularity conditions, the multi-parameter Cramr-Rao lower bound
states that in matrix notation
∂b (θ) −1 ∂b (θ)
Covθ (T (X1 , . . . , Xn )) ≥ [I (θ)] . (2.7)
∂θ ∂θ
The matrix inequality above of the form A ≥ B is interpreted to mean that the difference
A−B is positive semi-definite. In general the regularity conditions require the existence
of the Fisher information and demand that either the density function has bounded
support and the bounds do not depend on θ or the density has infinite support, is
continuously differentiable and its support is independent of θ.
Example 2.7. Suppose that in Example 2.6 we would like to estimate θ = (μ, σ 2 )
where both the mean μ and the variance σ 2 are unknown. Let
T (X1 , . . . , Xn ) = X̄n , Sn2
20
2. Fundamental Concepts in Parametric Inference
be the vector of the sample mean and sample variance respectively. We have
2
σ
0
I (θ)−1 = n
4 .
0 2σn
Consequently, we conclude that X̄n attains the Cramr-Rao lower bound whereas Sn2
does not.
By far the most popular method for finding estimators is the method of maximum
likelihood developed by R.A. Fisher.
Provided the derivatives exist, the maximization of the likelihood may sometimes be
done by maximizing instead the logarithm of the likelihood since
Example 2.8. Let {X1 , . . . , Xn } be a random sample from a normal distribution with
mean μ and variance σ 2 . The log likelihood function is given by
n n (xi − μ)2
log L μ, σ 2
= − log (2π) − log σ 2 − .
2 2 2σ 2
The maximum likelihood equations are then
∂ log L (μ, σ 2 ) (xi − μ)
= =0
∂μ σ2
∂ log L (μ, σ 2 ) n (xi − μ)2
= − + =0
∂σ 2 2σ 2 2σ 4
2
(n−1)Sn
from which we obtain the maximum likelihood estimators X̄n and n
for the mean
and variance respectively.
21
2. Fundamental Concepts in Parametric Inference
1
n
Ȳ n = Yi
n i=1
1
n
Σ̂ = Yi − Ȳn Yi − Ȳn .
n i=1
Example 2.10. Suppose we have a random sample {X1 , . . . , Xn } from a uniform distri-
bution on the interval (0, θ). Since the range of the density depends on θ we cannot take
the derivative of the likelihood function. Instead we note that the likelihood function is
given by
1
n max1≤i≤n Xi < θ
L (θ) = θ
0 elsewhere
and hence, the maximum likelihood estimator of θ is max Xi .
1≤i≤n
n!
M
M
P (Nj = nj , j = 1, . . . , M ) = pn1 . . . pnMM , pj = 1, nj = n,
n1 . . . , n M 1 j=1 j=1
and it is called the multinomial distribution. The {X k } are i.i.d. with covariance matrix
having (i, j) entry σij given by
pi (1 − pi ) i = j
σij =
−pi pj i = j
22
2. Fundamental Concepts in Parametric Inference
Also the covariance matrix of N is not of full rank and is given in matrix notation by
1
√ (N − np) → N (0, Σ) .
n
H0 : θ ∈ Θ 0
H1 : θ ∈ Θ 1 ,
where Θ0 and Θ1 are subsets of the parameter space. In the situation where both Θ0 , Θ1
consist of single points θ0 and θ1 respectively, the Neyman-Pearson lemma provides
an optimal solution. Set x = (x1 , . . . , xn ) and let φ (x) be a critical function which
represents the probability of rejecting the null hypothesis when x is observed. Also let
α denote the prescribed probability of rejecting the null hypothesis when it is true (also
known as the size of the test).
23
2. Fundamental Concepts in Parametric Inference
Lemma 2.4 (Neyman-Pearson Lemma). Suppose that there exists some dominating
measure μ with respect to which we have densities f (x; θ0 ) and f (x; θ1 ) . Then the most
powerful test of H0 : θ = θ0 against H1 : θ = θ1 is given by
⎧ n
⎪
⎪ 1 if i=1
f (xi ;θ1 )
>k
⎨ n
i=1
n
f (xi ;θ0 )
f (xi ;θ1 )
φ(x) = γ if i=1 n =k
⎪
⎪ i=1 f (xi ;θ0 )
⎩0 if i=1 f (xi ;θ1 ) < k
n
n
i=1 f (xi ;θ0 )
where k is chosen such that Eθ0 [φ (X)] = α. The power function of the test φ is defined
to be " n
Eθ [φ (X)] = φ (X) f (xi ; θ) dx1 . . . dxn
i=1
and it represents the probability of rejecting the null hypothesis for a given θ.
Example 2.12. Given a random sample {X1 , . . . , Xn } from a normal distribution with
unknown mean μ and known variance σ 2 , suppose that we are interested in testing the
null hypothesis
H0 : μ = μ0
against the alternative hypothesis
H1 : μ = μ1 > μ0 .
Then it can be shown that the uniformly most powerful test is given by
⎧
⎪
⎪
⎨1 X̄n > k
φ (x) = γ X̄n = k
⎪
⎪
⎩0 X̄n < k
where k = μ0 + zα √σn , and zα is the upper α-point of the standard normal distribution.
Example 2.13. In the case of a random sample from an exponential distribution with
mean θ, describing the lifetimes of light bulbs, the uniformly most powerful test of
H 0 : θ = θ0
H 1 : θ = θ1 < θ0 .
24
2. Fundamental Concepts in Parametric Inference
is given by ⎧
⎪
⎪
⎨1 X̄n > k
φ (x) = γ X̄n = k
⎪
⎪
⎩0 X̄n < k
where k is a solution of Γ(n, nk/θ0 ) = α(n − 1)!. Here, Γ(a, b) is an upper incomplete
∞
gamma function defined as b ua−1 e−u du.
Uniformly most powerful (UMP) tests rarely exist in practice. An exception occurs
when the family of densities possesses a monotone likelihood ratio.
Definition 2.9. We shall say that a family of densities {f (x; θ) , θ ∈ Θ} has monotone
likelihood ratio if the ratio ff (x;θ 2)
(x;θ1 )
is nondecreasing in some function T (x) for all θ1 < θ2
in some interval Θ.
f (x; θ2 )
= exp {(θ2 − θ1 ) t (x) − (K (θ2 ) − K (θ1 ))}
f (x; θ1 )
H 1 : μ > μ0 .
The family has monotone likelihood ratio in x. The uniformly most powerful test of
H0 : μ = μ0
H 1 : μ = μ1 > μ0
25
2. Fundamental Concepts in Parametric Inference
The likelihood ratio test rejects the null hypothesis whenever Λn is small enough. The
factorization theorem shows that the likelihood ratio Λn is based on the sufficient statistic
and moreover, it is invariant under transformations of the parameter space that leave
the hypothesis and alternative hypotheses invariant. For a random sample of size n from
the exponential family (2.4) we have
where t̄n = ni=1 t (xi ) /n.
In certain situations, as in the case of a Cauchy location family, a uniformly powerful
test may not exist. On the other hand, a locally most powerful test which maximizes the
power function at θ0 may exist. Provided we may differentiate under the integral sign
the power function with respect to θ we see upon using the generalized Neyman-Pearson
lemma (see Ferguson (1967), p. 204) that a test of the form
⎧ #n #n
⎪
⎪ 1 if ∂
f (x ; θ ) > k
⎨ ∂θ i=1 i 0 i=1 f (xi ; θ0 )
∂
# n # n
φ(x) = γ if ∂θ i=1 f (xi ; θ0 ) = k i=1 f (xi ; θ0 )
⎪
⎪ # #
⎩0 if ∂ n f (x ; θ ) < k n f (x ; θ )
∂θ i=1 i 0 i=1 i 0
26
2. Fundamental Concepts in Parametric Inference
Definition 2.10. Let {X1 , . . . , Xn } be a random sample from some distribution hav-
ing density f (x; θ) , θ ∈ Rp . Let L (θ; x) be the likelihood function where x =
(x1 , . . . , xn ) . Let
(θ; x) = log L (θ; x)
The derivative
∂ (θ; x)
U (θ; x) =
∂θ
is called the score function.
The locally most powerful test can be seen to be equivalently based on the score
function since
∂ " n
U (θ; x) = log f (xi ; θ 0 )
∂θ i=1
#n
∂
f (xi ; θ 0 )
= ∂θ#n i=1 .
i=1 f (xi ; θ 0 )
27
2. Fundamental Concepts in Parametric Inference
∂ 2 (θ; x) ∂ (θ; x) ∂ (θ; x)
= f (x; θ) dx + f (x; θ) dx
∂θi ∂θj ∂θj ∂θj
For any hypothesis testing problem, there are three distinct possible test statistics: the
likelihood ratio test, the Wald test, and the Rao score test, all of which are asymptotically
equivalent as the sample size gets large. In the lemma below, we outline the proof for
the asymptotic distribution of the Rao score test.
Lemma 2.6 (Score Test). Let {X1 , . . . , Xn } be a random sample from a continuous
distribution having density f (x; θ) , θ ∈ Rp and suppose we wish to test
H0 : θ ∈ Θ 0
has, as the sample size gets large, under the null hypothesis a χ2k distribution where k
is the number of constraints imposed by the null hypothesis.
Proof. The result follows from the multivariate central limit theorem since the score is
a sum of independent identically distributed random vectors
∂ "n
U (θ; X) = log f (Xi ; θ)
∂θ i=1
n
∂
= log f (Xi ; θ)
i=1
∂θ
Theorem 2.7 (The Three Amigos). Let X = {X1 , . . . , Xn } be a random sample from a
continuous distribution having density f (x; θ) , θ ∈ Θ, the parameter space. Suppose we
are interested in testing the general null hypothesis H0 : θ ∈ Θ0 against the alternative
H1 : θ ∈ Θ1 = Θ − Θ0 . Then, the likelihood test, the Wald test, and the score test all
reject the null hypothesis for large values and are asymptotically distributed as central
chi-square distributions with k degrees of freedom as n → ∞ where k is the number of
constraints imposed by the null hypothesis. Specifically,
28
2. Fundamental Concepts in Parametric Inference
Example 2.14. Let {X1 , . . . , Xn } be a random sample from the Poisson distribution
with mean θ > 0 given by
e−θ θx
f (x; θ) = , x = 0, 1, . . . , θ > 0.
x!
Suppose we wish to test
H 0 : θ = θ0
29
2. Fundamental Concepts in Parametric Inference
whereas the Fisher information is given by I (θ) = nθ . Hence the score test rejects
whenever
θ0 n 2 n 2
X̄ − θ0 = X̄ − θ0 > χ21 (α)
n θ0 θ0
The Wald test rejects whenever
n 2
X̄ − θ0 > χ21 (α) .
X̄
The likelihood ratio test rejects the null hypothesis whenever
L (θ0 ; X)
− 2 log
> χ21 (α) (2.8)
L θ̂; X
Example 2.15. Let {X1 , . . . , Xn } be a random sample from the normal distribution
with mean μ and variance σ 2 . Suppose we wish to test
H0 : μ = μ0
30
2. Fundamental Concepts in Parametric Inference
Notes
1. The score test geometrically represents the slope of the log-likelihood function
evaluated at θ̂ 0 . It is locally most powerful for small deviations from θ 0 . This can
be seen from a Taylor series expansion of the likelihood function L (θ 0 + h) around
θ 0 for small values of the vector h.
2. The Wald test is not invariant under re-parametrization unlike the score test which
is (Stein, 1956). It is easy to construct confidence intervals using the Wald statistic.
However, since the standard error is calculated under the alternative, it may lead
to poor confidence intervals.
3. Both the likelihood ratio test and the Wald test are asymptotically equivalent
to the score test under the null as well as under Pitman alternative hypotheses
(Serfling (2009), p. 155). All the statistics have limiting distributions that are
weighted chi-squared if the model is misspecified. See (Lindsay and Qu, 2003) for
a more extensive discussion of the score test and related statistics.
4. An interesting application of the score test is given by (Jarque and Bera (1987))
who consider the Pearson family of distributions.
In the next section we study the concept of contiguity and its consequences.
2.3.2.2. Contiguity
In the previous section, we were concerned with determining the asymptotic distribution
of the test statistics under the null hypothesis. It is of interest for situations related to the
efficiency of tests, to determine the asymptotic distribution of the test statistic under
the alternative hypothesis as well. Since most of the tests considered in practice are
consistent, they tend to perform well for alternatives “far away” from the null hypothesis.
Consequently, the interest tends to focus on alternatives which are “close” to the null
hypothesis in a manner to be made more precisely here. The concept of contiguity
was introduced by Le Cam in connection with the study of local asymptotic normality.
Specifically, the concept enables one to obtain the limiting distribution of a sequence of
statistics under the alternative distribution from knowledge of the limiting distribution
under the null hypothesis distribution. We follow closely the development and notation
in (Hájek and Sidak (1967); van der Vaart (2007)) and we begin with some definitions.
31
2. Fundamental Concepts in Parametric Inference
Definition 2.11. Suppose that P and Q are two probability measures defined on the
same measurable space (Ω, A) . We shall say that Q is absolutely continuous with respect
to P , denoted Q P if Q (A) = 0 whenever P (A) = 0 for all measurable sets A ∈ A.
If Q P , then we may compute the expectation of a function f (X) of a random
vector X under Q from knowledge of the expectation under P of the product f (X) dQ
dP
through the change of measure formula
dQ
EQ [f (X)] = EP f (X)
dP
The expression dQdP
is known as the Radon-Nikodym derivative. The asymptotic version
of absolute continuity is the concept of contiguity.
Definition 2.12. Suppose that Pn and Qn are two sequences of probability measures
defined by the same measurable space (Ωn , An ) . We shall say that Qn is contiguous with
respect to Pn , denoted Qn Pn if Qn (An ) → 0 whenever Pn (An ) → 0 for all measurable
sets An ∈ An . The sequences Pn , Qn are mutually contiguous if both Qn Pn and
Pn Qn .
Example 2.16. Suppose that under Pn we have a standard normal distribution whereas
under Qn we have a normal distribution with mean μn → μ, and variance σ 2 > 0. Then
it follows that Pn and Qn are mutually contiguous. Note however that if μn → ∞, then
for An = [μn , μn + 1] , we have that Pn (An ) → 0, yet Qn (An ) →constant.
Definition 2.13. Le Cam proposed three lemmas which enable us to verify contiguity
and to obtain the limiting distribution under one measure given the limiting distribution
under another. Suppose that Pn and Qn admit densities pn , qn respectively and define
the likelihood ratio Ln for typical points x in the sample space
⎧
⎪
⎪
qn (x)
⎨ pn (x) pn (x) > 0
Ln (x) = 1 pn (x) = qn (x) = 0
⎪
⎪
⎩0 pn (x) = 0 < qn (x)
Lemma 2.7 (Le Cam’s First Lemma). Let Fn be the cdf of Ln which converges weakly
under Pn at continuity points to a distribution F for which
∞
xdF (x) = 1.
0
32
2. Fundamental Concepts in Parametric Inference
follows from the fact that the moment generating function of a normal (μ, σ 2 ) evaluated
at t = 1 must equal 1; that is for large n,
σ2
E elog Ln → eμ+ 2
σ2 2
+ σ2
= e− 2 = 1.
Le Cam’s second lemma provides conditions under which a log likelihood ratio is
asymptotically normal under the hypothesis probability measure.
Let x = (x1 , . . . , xn ) and
"
n
pn (x) = fni (xi )
i=1
and
"
n
qn (x) = gni (xi )
i=1
Then
n
log Ln (X) = log [gni (Xi ) /fni (Xi )]
i=1
Let n (
1
)
Wn = 2 [gni (Xi ) /fni (Xi )] 2 − 1
i=1
Lemma 2.8 (Le Cam’s Second Lemma). Suppose that the following uniform integrability
condition holds
lim max Pn (|gni (Xi ) /fni (Xi ) − 1| > ε) = 0, for all ε > 0
n→∞ 1≤i≤n
2
and Wn is asymptotically N − σ4 , σ 2 under Pn . Then
σ 2
lim Pn log Ln (X) − Wn + > ε = 0, ε > 0
n→∞ 4
2
and under Pn , log Ln (X) is asymptotically N − σ2 , σ 2 .
The third lemma which is most often utilized establishes the asymptotic distribution
under the alternative hypothesis provided the measures are contiguous.
Lemma 2.9 (Le Cam’s Third Lemma). Let Pn , Qn be sequences of probability measures
on measurable spaces (Ωn , An ) such that Qn Pn . Let Yn be a sequence of k-dimensional
33
2. Fundamental Concepts in Parametric Inference
Then, under Qn
L
Yn −
→ Nk (μ + τ, Σ)
where ˙θ (Xi ) is the k-dimensional vector of partial derivatives. The expansion in (2.10) is
known as the local asymptotic normality property (LAN). It follows from
the central limit
h I θ h
theorem that the right-hand side in (2.10) is asymptotically normal N − 2 , h Iθ h
and consequently, by Le Cam’s first lemma, the measures pθ , pθ+h/√n on the left-hand
side are contiguous. Next, suppose that we have a sequence of test statistics Tn which
are to first order, sums of i.i.d. random variables:
1
n
√
n (Tn − μ) = √ ψ (Xi ) + opθ (1)
n i=1
˙
for some function ψ. Consequently, the joint distribution of √1 ψ (Xi ) , θ (Xi ) is
i n
under pθ asymptotically multivariate normal
√ " pθ+h/√n (Xi ) L 0 Σ τ
n (Tn − μ) , log −
→ Nk+1 h Iθ h ,
i
p θ (X i ) − 2
τ h Iθ h
√
It then follows from Le Cam’s third lemma that n (Tn − μ) is asymptotically normal
under pθ+h/√n .
We may demonstrate contiguity in the case of an exponential family of distributions.
Example 2.17. Let X = (X1 , . . . , Xn ) be a random sample from the exponential family
for a real valued parameter θ
34
2. Fundamental Concepts in Parametric Inference
(θ; X) = θ t (Xi ) − nK (θ)
i=1
n
(θ; X) = t (Xi ) − nK (θ)
i=1
(θ; X) = −nK (θ)
It follows that
n
The concept of contiguity will enable us to obtain the non-null asymptotic distribu-
tion of various linear rank statistics to be defined in later chapters.
"
K
LC = (Lk (θ; x))wk
k=1
35
2. Fundamental Concepts in Parametric Inference
where the {wk } are nonnegative weights selected to improve efficiency. The composite
log-likelihood is
K
cl (θ; x) = wk k (θ; x)
k=1
with
k (θ; x) = log Lk (θ; x) .
In the simplest case of independence, we have a genuine likelihood function
"
m
Lind (θ; x) = fX (xr ; θ) .
r=1
"
m "
m
fX (xr − xs ; θ) .
r=1 s
=r
"
m−1 "
m
Ldif f (θ; x) = fX (xr − xs ; θ) .
r=1 s=r+1
∂
U (θ; x) = cl (θ; x) (2.11)
∂θ
∂
H (θ) = Eθ − U (θ; X) (2.12)
∂θ
J (θ) = varθ {U (θ; X)} (2.13)
−1
G (θ) = H (θ) J (θ) H (θ) (2.14)
∂
I (θ) = varθ log f (X; θ) (2.15)
∂θ
36
2. Fundamental Concepts in Parametric Inference
where Np (μ, G−1 (θ)) is a p-dimensional normal distribution with vector mean μ and
variance-covariance matrix G−1 (θ).
and the conditional density of θ given X = x, labeled the posterior density of θ is given
by Bayes’ theorem:
f (x|θ) p (θ; α)
p (θ|x) =
f (x|θ) p (θ; α) dθ
In view of the factorization theorem in Section 2.3.1, the posterior density is always a
function of the sufficient statistic. The use of Bayesian methods enables us to update
information about the prior. There have been countless applications of Bayesian infer-
ence in practice. We refer the reader to Box and Tiao (1973) for further details. We
provide below some simple examples, whereas in Part III of this book, we consider a
modern application of Bayesian methods to ranking data.
Example 2.18. Let x = {x1 , . . . , xn } be a set of observed data randomly drawn from
a normal distribution with mean θ and variance σ 2 . Assume that θ is itself a random
variable having a normal prior distribution with unknown mean μ and known variance
τ 2 . Then the posterior distribution of θ given x is proportional to
"
n
p (θ|x) ≈ f (xi |θ) p (θ; α)
i=1
−2
nσ −2 2 τ 2
≈ exp − (x̄n − θ) exp − (θ − μ)
2 2
−2
τn 2
≈ exp − (θ − μn )
2
37
2. Fundamental Concepts in Parametric Inference
where
nσ −2 x̄n + τ −2 μ
μn =
nσ −2 + τ −2
τn−2 = nσ −2 + τ −2
We recognize therefore this posterior to be a normal density with mean μn and vari-
ance τn2 .
f (x|θ) = θx (1 − θ)1−x , x = 0, 1.
Suppose that the prior for θ is the Beta distribution with parameters α > 0, β > 0
Γ (α + β) α−1
p (θ; α, β) = θ (1 − θ)β−1 , 0 < θ < 1.
Γ (α) Γ (β)
Then the posterior distribution of θ given x is again a Beta distribution but with pa-
rameters
αn = α + xi , βn = β + n − xi .
In any given problem involving Bayesian methods, the specification of the prior
and the consequent computation of the posterior may pose some difficulties. In certain
problems, this difficulty is overcome if the prior and the posterior come from the same
family of distributions. The prior in that case is called a conjugate prior . In the previous
example, both the prior and the posterior distributions were normal distributions but
with different parameters. In many modern problems not involving conjugate priors,
the Bayesian computation of the posterior distribution is a challenging task. The goal
in many instances is to compute the expectation of some function h (θ) with respect to
the posterior distributions:
h (θ) f (x|θ) p (θ; α) dθ
E[h(θ)|x] = h(θ)p(θ|x)dθ =
f (x|θ) p (θ; α) dθ
We encounter such integrals for example when one is interested in the posterior proba-
bility that h(θ) lies in an interval. If one can draw a random sample θ(1) , θ(2) , . . . , θ(m)
from p(θ|x), the strong law of large numbers guarantees that E[h(θ)|x] is well approxi-
mated by the sample mean of h(θ): m1 m (i)
i=1 h(θ ) if m is large enough. What happens
if p(θ|x) is hard to sample? One way of proceeding is to approximate the posterior dis-
tribution by a multivariate normal density centered at the mode of θ obtained through
38
2. Fundamental Concepts in Parametric Inference
an optimization method (see Albert (2008) p. 94). Another approach is to generate sam-
ples from the posterior distribution of θ indirectly via various simulation methods such
as importance sampling, rejection sampling, and Markov Chain Monte Carlo (MCMC)
methods. We describe these methods below.
1. Importance Sampling.
When it is hard to sample from p(θ|x) directly, one can resort to importance
sampling. Suppose we are able to sample from another distribution q(θ). Then
p(θ|x)
Ep [h(θ)] = h(θ)p(θ|x)dθ = h(θ) q(θ)dθ = Eq [h(θ)w(θ)],
q(θ)
with a weight w(θ) = p(θ|x)/q(θ). One can draw a random sample θ(1) , θ(2) , . . . , θ(m)
from q(θ) and E[h(θ)|x] can be approximated by h(θ)* = 1 m h(θ(i) )w(θ(i) ). In
m i=1
practice, q(θ) should be chosen so that it is easy to be sampled and can achieve
a small estimation error. For Monte Carlo error, one can choose q(θ) to minimize
* see Robert and Casella (2004).
the variance of h(θ),
2. Rejection Sampling.
Rejection sampling consists of identifying a proposal density say q (θ) which “re-
sembles” the posterior density with respect to location and spread and which is
easy to sample from. As well, we ask that for all θ and a constant c,
p (θ|x) ≤ cq (θ)
p (θ|x)
U≤ .
cq (θ)
We may justify the procedure as follows. Suppose in general that Y has density
fY (y) and V has density fV (v) with common support with
fY (y)
M = sup
y fV (y)
39
2. Fundamental Concepts in Parametric Inference
It is easy to see that the value of Y generated from this rejection algorithm is
distributed as fY :
1 fY (V )
P (Y ≤ y) = P V ≤ y|U <
M fV (V )
P V ≤ y, U < M1 ffVY (V )
(V )
=
P U < M1 ffVY (V )
(V )
y M1 ffVY (v)
(v)
−∞ 0 fV (v) dudv
=
∞ 1 fY (v)
M fV (v)
−∞ 0 fV (v) dudv
y
= fY (v) dv.
−∞
- The Gibbs Sampling algorithm was developed by Geman and Geman (1984). See
Casella and George (1992) for an intuitive exposition of the algorithm. This algo-
rithm helps generate random variables indirectly from the posterior distribution
without relying on its density. The following describes the algorithm.
Suppose we have to estimate m parameters θ1 , θ2 , . . . , θm−1 and θm . Let p(θi |
(t)
x, θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θm ) be the full conditional distribution of θi and θi be
the random variate of θi simulated in the tth iteration. Then the procedure of
Gibbs Sampling is
(0) (0) (0)
(a) Set initial values for the m parameters {θ1 , θ2 , . . . , θm }.
(b) Repeat the steps below. At the tth iteration,
(t) (t) (t−1) (t−1) (t−1) (t−1)
(1) draw θ1 from p(θ1 | x, θ2 , θ3 , θ4 , . . . , θm )
(t) (t) (t) (t−1) (t−1) (t−1)
(2) draw θ2 from p(θ2 | x, θ1 , θ3 , θ4 , . . . , θm )
(t) (t) (t) (t) (t−1) (t−1)
(3) draw θ3 from p(θ3 | x, θ1 , θ2 , θ4 , . . . , θm )
..
.
(t) (t) (t) (t) (t) (t)
(m) draw θm from p(θm | x, θ1 , θ2 , θ3 , . . . , θm−1 )
Tierney (1994) showed that if this process is repeated many times, the random
variates drawn will converge to a single variable drawn from the joint posterior
distribution of θ1 , θ2 , . . . , θm−1 and θm given x. Suppose that the process is re-
peated M + N times. In practice, one can discard the first M iterates as the
burn-in period. In order to eliminate the autocorrelations of the iterates, one in
40
2. Fundamental Concepts in Parametric Inference
every a observations is kept in the last N iterates. The choice of M can be deter-
mined by examining the plot of traces of the Gibbs iterates for stationary of the
Gibbs iterates. The value of a could be selected by examining the autocorrelation
plots of the Gibbs iterates.
41
2. Fundamental Concepts in Parametric Inference
To retain the possible dependency among the θi ’s, one may adopt a structural factoriza-
tion of q(θ) as
q(θ) = q1 (θ1 )q2 (θ2 |θ1 )q3 (θ3 |θ2 , θ1 ).
Under either the mean-field approximation or the structural factorization, we can convert
the problem of minimizing the KL divergence with respect to the joint distribution q(θ)
to that of minimizing KL divergences with respect to individual univariate distribution
qi ’s. For an application to an angle-based ranking model, see Chapter 11.
2.4. Exercises
Exercise 2.1. Suppose that the conditional density of Y given μ is normal with mean μ
and variance σ 2 . Suppose that the distribution of μ is normal with mean m and variance
Y −E[Y ]
τ 2 . Show that √ is also normally distributed.
V ar(Y )
(b) Find the distribution of the range X(n) − X(1) when F is given by the uniform on
the interval (0, 1) .
(c) Find the distribution of the sample median when n is an odd integer.
Exercise 2.3. Suppose that in Exercise 2.2, F (x) is the exponential cumulative distri-
bution with mean 1.
(a) Show that the pairwise differences X(i) − X(i−1) for i = 2, . . . , n are independent.
42
2. Fundamental Concepts in Parametric Inference
Yi = xi β + ei , i = 1, . . . , n,
where β and xi are p × 1 vectors and {ei } are i.i.d. N (0, σ 2 ) random error terms.
Suppose we wish to test
H0 : Aβ = 0
against
H1 : Aβ = 0,
where A is a k × p known matrix.
Exercise 2.6. Suppose that we have a random sample of size n from a Bernoulli distri-
bution with mean θ. Suppose that we impose a Beta(α, β) prior distribution on θ. Find
the posterior distribution of θ.
Exercise 2.7. Suppose that we have a random sample of size n from some distribution
having density f (x|θ) conditional on θ. Suppose that there is a prior density g (θ) on θ.
We would like to estimate θ using square error loss, L (θ, a) = (θ − a)2 .
(a) Show that the mean of the posterior distribution minimizes the Bayes risk
EL (θ, a) ;
43
2. Fundamental Concepts in Parametric Inference
Exercise 2.8. Suppose that X has a uniform density on the interval θ − 12 , θ + 12 .
X +X
(a) Show that for a random sample of size n, the estimator (n) 2 (1) has mean square
1
error 2(n+2)(n+1) where X(n) , X(1) are respectively the maximum and minimum of
the sample.
(b) Show that the estimator X(n) − 12 is consistent and has mean square error 2
(n+2)(n+1)
.
Exercise 2.9. Suppose that under Pn we have a uniform distribution on the interval [0, 1]
whereas under Qn the distribution is uniform on the interval [an , bn ] with an → 0, bn → 1.
Show that Pn , Qn are mutually contiguous.
44
3. Tools for Nonparametric
Statistics
Nonparametric statistics is concerned with the development of distribution free methods
to solve various statistical problems. Examples include tests for a monotonic trend, or
tests of hypotheses that two samples come from the same distribution. One of the
important tools in nonparametric statistics is the use of ranking data. When the data
is transformed into ranks, one gains the simplicity that objects may be more easily
compared. For example, if web pages are ranked in accordance to some criterion, one
obtains a quick summary of choices. In this chapter, we will study linear rank statistics
which are functions of the ranks. These consist of two components: regression coefficients
and a score function. The latter allows one to choose the function of the ranks in order
to obtain an “optimal score” while the former is tailor made to the problem at hand.
Two central limit theorems, one due to Hajek and Sidak and the other to Hoeffding, play
important roles in deriving asymptotic results for statistics which are linear functions of
the ranks.
The subject of U statistics has played an important role in studying global properties
of the underlying distributions. Many well-known test statistics including the mean and
variance of a random sample are U statistics. Here as well, Hoeffding’s central limit
theorem for U statistics plays a dominant role in deriving asymptotic distributions. In
the next sections we describe in more detail some of these key results.
where
1 u > 0,
I [u] =
0 u < 0.
We begin with the definition of a linear rank statistic.
is called a linear rank statistic where the {a (i, j)} represents a matrix of values. The
statistic is called a simple linear rank statistic if
a (i, Ri ) = ci a (Ri )
where {c1 , . . . , cn } and {a (1) , . . . , a (n)} are given vectors of real numbers. The {ci } are
called regression coefficients whereas the {ai } are labeled scores.
To illustrate why the {ci } are called regression coefficients, suppose that X1 , . . . , Xn
are a set of independent random variables such that
Xi = α + Δci + εi ,
where {εi } are i.i.d. continuous random variables with cdf F having median 0. We are
interested in testing hypotheses about the slope Δ. Denoting the ranks of the differences
{Xi − Δci } by R1 , . . . , Rn and assuming that c1 ≤ c2 ≤ . . . ≤ cn we see that the usual
sample correlation coefficient for the pairs (ci , a (Ri )) is
(ci − c̄) (a (Ri ) − ā)
i ,
2 2
i (ci − c̄) i (a (Ri ) − ā)
where a (.) is an increasing function and c̄ and ā are respectively the means of the
{ci }and{a (Ri )}. Some simplification shows this is a simple linear rank statistic.
Many test statistics can be expressed as simple linear rank statistics. For example,
suppose that we have a random sample of n observations with corresponding ranks {Ri }
and we would like to test for a linear trend. In that case, we may use the simple linear
rank statistic
n
Sn = iRi
i=1
which correlates the ordered time scale with the observed ranks of the data.
46
3. Tools for Nonparametric Statistics
ai
2
Mood i − n+1
2
Freund-Ansari-Bradley i − n+1
2
⎧
⎪
⎪ 2i i even, 1 < i ≤ n2
⎪
⎨2i − 1 i odd, 1 ≤ i < n2
Siegel-Tukey
⎪2 (n − i) + 2 i even, n2 < i ≤ n
⎪
⎪
⎩
2 (n − i) + 1 i odd, n2 < i < n
−1 i 2
Klotz Φ n+1
2
Normal E V(i)
In another example, suppose we have a random sample of size n from one population
(X) and N − n from another (Y ). We are interested in testing the null hypothesis that
the two populations are the same against the alternative that they differ in location.
We may rank all the N observations together and retain only the ranks of the second
population (Y ) denoted Ri , i = n + 1, . . . , N , by choosing
0 i = 1, . . . , n
ci =
1 i = n + 1, . . . , N.
The test statistic takes the form Tn = N i=n+1 ci Ri . Properties of Tn under the null
and alternative hypotheses are given in Gibbons and Chakraborti (2011).
We may generalize this test statistic by defining functions {a (Ri ) , i = 1, . . . , N } of
the ranks and choosing
N
Tn = ci a (Ri ) .
i=n+1
47
3. Tools for Nonparametric Statistics
These functions may be chosen to reflect emphasis either on location or on scale. For
both the problems of location and scale, Tables 3.1 and 3.2 list values of the constants
for location and scale alternatives. Here, V(i) represents the ith order statistic from
a standard normal with cumulative distribution function Φ (x). We shall see in later
chapters that this results in the well-known two-sample Wilcoxon statistic.
Lemma 3.1. Suppose that the set of rankings (R1 , . . . , Rn ) are uniformly distributed
over the set of n! permutations of the integers 1,2,. . . ,n. Let Sn = ni=1 ci a (Ri ) . Then,
(a) for i = 1, . . . , n,
n+1 (n2 − 1)
E [Ri ] = , V ar (Ri ) =
2 12
and for i = j,
n+1
Cov (Ri , Rj ) = − .
12
(b) E [Sn ] = nc̄ā, where c̄ = ni=1 ci /n and ā = ni=1 a (i) /n.
n 2 n 2
i=1 (ci − c̄) i=1 (a (i) − ā) .
1
(c) V ar(Sn ) = n−1
n
(d) Let Tn = i=1 di b (Ri ) , for regression coefficients {di } and score function b (.).
Then
n
Cov (Sn , Tn ) = σab (ci − c̄) di − d¯ ,
i=1
n n
where d¯ = i=1 di /n and σab = 1
n−1 i=1 (a (i) − ā) (b (i) − b̄).
Proof. The proof of this lemma is straightforward and is left as an exercise for the
reader.
Example 3.1. The simple linear rank statistic Sn = ni=1 iRi can be used for testing
the hypothesis that two continuous random variables are independent. Suppose that we
observe a random sample of pairs (Xi , Yi ) , i = 1, . . . , n. Replace the X s by their ranks,
say R1 , . . . , Rn and the Y s by their ranks, say Q1 , . . . , Qn . A nonparametric measure of
the correlation between the variables is the Spearman correlation
i Ri − R̄ Q1 − Q̄
ρ = 2 2
i Ri − R̄ i Q1 − Q̄
i i− 2 Ri − n+1
n+1
2
= n+1 2
i i− 2
E[ρ] = 0, V ar(ρ) = 1.
48
3. Tools for Nonparametric Statistics
Under certain conditions, linear rank statistics are asymptotically normally distributed.
We shall consider square integrable functions φ defined on the interval (0, 1) which
have the property that they can be written as the difference between two nondecreasing
functions and satisfy 1
2
0< φ (u) − φ̄ du < ∞
0
1
where φ̄ = 0
φ (u) du. An important result for proving limit theorems is the Projection
Theorem.
Theorem 3.1 (Projection Theorem). Let T (R1 , . . . , Rn ) be a rank statistic and set
â (i, j) = E [T |Ri = j]
Then the projection of T into the family of linear rank statistics is given by
n−1
n
T̂ = â (i, Ri ) − (n − 2) E[T ].
n i=1
Proof. The proof of this theorem is given in (Hájek and Sidak (1967), p. 59).
Example 3.2 (Hájek and Sidak (1967), p. 60). Suppose that R1 , . . . , Rn are uniformly
distributed over the set of n! permutations of the integers 1, 2, . . . , n. The projection of
the Kendall statistic defined by
i
=j sgn (i − j) sgn (Ri − Rj )
τ=
n (n − 1)
49
3. Tools for Nonparametric Statistics
and hence
1 2j − n − 1 n + 1 − 2j
E [τ̂ |Rk = j] = sgn (k − h) + sgn (i − k)
n (n − 1) h n−1 i
n−1
8 i i − n+12
Ri − n+1
2
= 2 .
n (n − 1)
Theorem 3.2. Suppose that R1 , . . . , Rn are uniformly distributed over the set of n!
permutations of the integers 1,2,. . . ,n. Let the score function be given by any one of the
following three functions:
i
a (i) = φ
n+1
i/n
a (i) = n φ (u) du
(i−1/n)
a (i) = E φ Un(i)
(1) (n)
where φ is a square integrable function on (0, 1) and Un < . . . < Un are the order
statistics from a sample of n uniformly distributed random variables. Let
n
Sn = ci a (Ri ) .
i=1
1
n
σn2 = (ci − c̄)2 (ai − ā)2
n − 1 i=1
1
n
2 2
≈ (ci − c̄) φ (u) − φ̄ du.
i=1 0
50
3. Tools for Nonparametric Statistics
Proof. The details of the proof of this theorem, given in (Hájek and Sidak (1967), p. 160),
+ Ri ,
consist in showing that the n+1 behave asymptotically like a random sample of uni-
formly distributed random variables and then that Sn is equivalent to a sum of indepen-
dent random variables to which we can apply the Lindeberg-Feller central limit theorem
(Theorem 2.4).
The asymptotic distribution of various linear rank statistics under the alternative
was obtained using the concept of contiguity (Hájek and Sidak (1967) which we further
describe in Chapter 9 when we discuss efficiency. See also (Hajek (1968)) for a more
general result.
3.2. U Statistics
The theory of U statistics was initiated and developed by Hoeffding (1948) who used
it to study global properties of a distribution function. Let {X1 , . . . , Xn } be a random
sample from some distribution F. Let h (x1 , . . . , xm ) be a real valued measurable function
symmetric in its arguments such that
E [h (X1 , . . . , Xm )] = θF .
where the summation is taken over all combinations of m integers (i1 < i2 < . . . < im )
chosen from (1, 2, . . . , n).
An important property of a U statistic is that it is a minimum variance unbiased
estimator of θF . There are numerous examples of U statistics which include the moments
of a distribution, the variance, and the serial correlation coefficient. We present some
below.
Example 3.3. (i) The moments of a distribution are given by the choice h (x) = xr .
(xi −xj )2
(ii) The sample variance is obtained from the choice of h (xi , xj ) = 2
from which
we get
(xi − xj )2
Sn2 = (n2 )−1 , n > 1.
i<j
2
51
3. Tools for Nonparametric Statistics
(iii) Let (xi , yi ) be a sequence of n pairs of real numbers and construct for each coordi-
nate, the n (n − 1) signs of differences {sgn (xi − xj ) , i = j} {sgn (yi − yj ) , i = j}.
Define the kernel
with the sum extending over all possible pairs of indices is the covariance between
the signs of the differences between the two sets. This is the Kendall statistic often
used for measuring correlation.
(iv) Gini’s mean difference statistic for a set of n real numbers {xi } given by
1
|xi − xj |
n (n − 1) i
=j
(v) The serial coefficient for a set of n real numbers {xi } given by
n−1
xi xi+1
i=1
is a U statistic.
We may obtain a general expression for the variance of a U statistic. Denote
for c = 0, 1, . . . , m, the conditional expectation of h (X1 , . . . , Xm ) given (X1 = x1 , ..Xc
= xc , Xc+1 , . . . , Xm ) by
m
n−m
V ar (Un ) = (nm )−1 (m
c ) m−c σc2 ,
c=1
where σc2 = V ar [hc (X1 , . . . , Xc )]. Moreover, the variances are nondecreasing:
σ12 ≤ σ22 ≤ . . . ≤ σm
2
52
3. Tools for Nonparametric Statistics
2
and for large n, if σm < ∞,
V ar (Un ) ∼
= m2 σ12 /n
Proof. See Ferguson (1996).
Example 3.4. Let X1 , . . . , Xn be a random sample for which E [X] = 0. Let the kernel
function be defined as the product
h (x1 , x2 ) = x1 x2 .
Then,
E [Un ] = 0
and
h1 (x) = E [h (X1 , X2 ) |X1 = x] = 0
This implies σ12 = 0 and hence we have a degeneracy of order 1.
An important property of U statistics is that under certain conditions, they have
limiting distributions as the next theorem states.
2
Theorem 3.4. Let σm < ∞.
L
∞
n (Un − θF ) −
→ λj Zj2 − 1 ,
1
where the {Zj } are i.i.d. N (0, 1) and the {λj } are the eigenvalues satisfying
h (x1 , x2 ) − θF = λj ϕk (x1 ) ϕk (x2 )
53
3. Tools for Nonparametric Statistics
Definition 3.4. Consider two independent random samples X1 , . . . , Xn1 from F and
Y1 , . . . , Yn2 from G. Let h (x1 , . . . , xm1 ; y1 , . . . , ym2 ) be a kernel function symmetric in
the x s and separately symmetric in the y s with finite expectation
where the sum is taken over all subscripts 1 ≤ i1 < . . . < im1 ≤ n1 chosen from
1, 2, . . . , n1 and subscripts 1 ≤ j < . . . < jm2 ≤ n2 chosen from 1, 2, . . . , n2 respectively.
In analogy with the one-sample case, define for i, j, the conditional expectation of
h (X1 , . . . , Xm1 ; Y1 , . . . , Ym2 ) given
(X1 = x1 , ..Xi = xi , Xi+1 , . . . , Xm1 ) and (Y1 = y1 , ..Yj = yj , Yj=1 , . . . , Ym2 ) by
n1 −1 n2 −1
m1
m2
n
n2 −m2
1 −m1
V ar (Un1 ,n2 ) = m1 m2 (m i
i ) m1 −i
m2
j m2 −j σij2 ,
i=1 j=1
where
m21 2 m22 2
σ2 =
σ10 + σ .
λ 1 − λ 01
Proof. See (Bhattacharya et al. (2016), or Lehmann (1975), p. 364) for a slight variation
of this result. The proof as in the one-sample case is based on a projection argument.
Example 3.5. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent random samples
from distributions F (x) and G (y) respectively. Let h (X, Y ) be a two-sample kernel
function and let the corresponding U statistic be given by
1
n1 n2
Un1 ,n2 = h (Xi , Yj ) .
n1 n2 i=1 j=1
54
3. Tools for Nonparametric Statistics
Set m1 = m2 = 1 in Theorem 3.5 and let h10 (x) = E [h (x, Y )] and h01 (y) = E [h (X, y)].
Then, as n1 + n2 → ∞, with n1n+n
1
2
→ λ,
√ L
n1 + n2 (Un1 ,n2 − θ) −
→ N 0, σ 2 ,
where
2
σ10 σ2
σ2 = + 01
λ 1−λ
and
2 2
σ10 = V ar [h10 (X)] , σ01 = V ar [h01 (Y )] .
where the summation is over all possible subsets of r indices chosen (1, . . . , n). It is then
seen that
(θ) ∼ [θ Un (x1 , . . . , xn ) − K(θ)] .
The log likelihood obtained in (3.2) will be shown to lead to well-known test statistics.
Remark. A modern detailed account of multivariate U statistics may be found in Lee
(1990) and Gotze (1987).
n
Sn = an (i, Rni ) ,
i=1
55
3. Tools for Nonparametric Statistics
1 1 1
n n n n
dn (i, j) = an (i, j) − an (g, j) − an (i, h) + 2 an (g, h) .
n g=1 n h=1 n g=1 h=1
and variance
1 2
V ar(Sn ) = d (i, j)
n−1 i j n
provided
1
n i j drn (i, j)
lim r2 = 0, r = 3, 4, . . . (3.3)
n→∞ 1
2
n i j dn (i, j)
where
1 1
ān = an (i) , b̄n = b (i) .
n i n i
3.4. Exercises
2(2n+5)
Exercise 3.1 (Hájek and Sidak (1967), p. 81). Show that V ar(τ ) = 9n(n−1) .
Hint: Define the Kendall statistics τn and τn−1 based on (X1 , . . . , Xn ) and (X1 , . . . , Xn−1 )
respectively. Using the fact that
1
Ri = 1 + sgn (Xi − Xj ) ,
2 j
=i
56
3. Tools for Nonparametric Statistics
show that
n−2 4 n+1
τn = τn−1 + Rn −
n n (n − 1) 2
from which we obtain the telescopic recursion for hi = [i (i − 1)]2 V ar(τi ) and
42
hi − hi−1 = i −1 .
3
Hence,
n
hn = (hi − hi−1 ) .
i=1
Exercise 3.2. Suppose that R1 , . . . , Rn are uniformly distributed over the set of n!
permutations of the integers 1, 2, . . . , n. Show that the conditions of Theorem 3.5 are
satisfied for the statistic Sn = ni=1 iRi .
Exercise 3.3. Apply the projection method to the sample variance for a random sample
of size n
1 2
n
Sn2 = Xi − X̄
n − 1 i=1
(Xi − Xj )2
= (n2 )−1
i<j
2
Exercise 3.4.
n+1 2
(a) Show that V ar(τ̂ ) = 4
9 n
1
n−1
. Hence, as n → ∞,
V ar(τ̂ )
→ 1.
V ar(τ )
(b) Show that this implies that the Kendall and Spearman statistics are asymptotically
equivalent.
57
3. Tools for Nonparametric Statistics
Exercise 3.5. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent random samples
from distributions F (x) and G (y), respectively. Let M be the number of pairs (Xi , Yj )
whereby Xi < Yj and let W represent the sum of the ranks of the Y ’s in the combined
samples. Show that
n2 (n2 + 1)
W =M+ .
2
Hint: For any j,
I (Xi < Yj ) + j = Rank (Yj )
i
Exercise 3.6. In Example 3.5, find the projection of Un1 ,n2 onto the space of linear rank
statistics for the X s and for the Y s.
Exercise 3.7 (Randles and Wolfe (1979)). Find the mean and variance of the statistic
n
2 Ri
Sn = i log
i=1
n+1
Exercise 3.8. Consider the Spearman Footrule distance between two permutations μ, ν
of length n defined as
n
d (μ, ν) = |μ (i) − ν (i)|
i=1
It was shown in Diaconis and Graham (1977) that when μ, ν are independent and are
uniformly distributed over the integers 1, 2, . . . , n
n2
E [d (μ, ν)] = + O (n)
3
2n3
V ar [d (μ, ν)] = + O n2 .
45
Use Hoeffding’s combinatorial central limit theorem with an (i, j) = |i − j| to show that
d (μ, ν) is asymptotically normal.
Exercise 3.9. An alternate form of the projection Theorem 3.1 (see van der Vaart
(2007), p. 176) is as follows.
Let R = (R1 , . . . , RN ) be the ranks of an i.i.d. sample from a uniform distribution
U1 , . . . < UN on (0, 1). Let
(i)
a (i) = E φ UN
58
3. Tools for Nonparametric Statistics
N
S̃N = N āN c̄N + (ci − c̄N ) φ (F (Xi ))
i=1
and
N
SN = ci E [φ (Ui ) |R] .
i=1
Then the projection of S̃N onto the space of linear rank statistics is SN in the sense that
E [SN ] = E S̃N
and
V ar SN − S̃N
→ 1, as N → ∞.
V ar S̃N
Use the above result to show that the Wilcoxon two-sample statistic defined in
Section 3.1
1
N
TN = Ri
N i=m+1
is asymptotically equivalent to
1 m n
−n F (Xi ) + m F (Yi ) .
N i=1 j=1
59
Part II.
61
4. Smooth Goodness of Fit Tests
Goodness of fit problems have had a long history dating back to Pearson (1900). Such
problems are concerned with testing whether or not a set of observed data emanate from
a specified distribution. For example, suppose we would like to test the hypothesis that
a set of n observations come from a standard normal distribution. Pearson proposed to
first divide the interval (−∞, ∞) into d subintervals and then calculate the statistic
d
(oi − ei )2
2
X =
i=1
ei
where oi and ei represent the number of observed and expected observations appearing
in the ith interval. The expected values {ei } are calculated as
ei = npi ,
where pi is the probability that a standard normal random variable falls in the ith
interval. The test rejects the null hypothesis that the data come from a standard normal
distribution whenever X 2 ≥ χ2d−1 (α) where χ2d−1 (α) represents the upper 100 (1 − α) %
point of a chi-square distribution with (d − 1) degrees of freedom.
Apart from having to specify the number and the choice of subintervals, one of
the drawbacks of the Pearson test is that the alternative hypothesis is vague leaving
the researcher in a quandary if in fact the test rejects the null hypothesis. Similarly,
the usual tests for goodness of fit proposed by Kolmogorov-Smirnov and Cramér von
Mises are omnibus tests that lack power when the alternatives specify departures in
location, scale, skewness, or kurtosis. Neyman (1937) in an attempt to deal with those
issues, reconsidered the problem by embedding it into a more general framework. This is
perhaps the first application of what has become known as exponential tilting whereby
the density specified by the null hypothesis is exponentially tilted to provide a density
under the alternative. Moreover, that transition from the null to the alternative occurs
in a smooth manner.
L
D (w||wo ) = wi log (wi /w0i ) ,
i=1
where w represents the true distribution. The Kullback-Leibler measure is not a met-
ric since it does not satisfy the metric properties of symmetry and triangle inequality.
However, it satisfies the Gibbs inequality D (w||w0 ) ≥ 0. This follows from the fact that
− log x is a strictly convex function and hence
− wi log (w0i /wi ) ≥ − log wi (w0i /wi ) = 0.
1
Let w0 = L
, . . . , L1 . Minimizing with respect to the {wi } the Lagrange multiplier
expression
D (w||w0 ) + θ μ − w i xi + λ 1 − wi
eθxi
wi = θxi
e
eθ(xi −μ̂)
= θ(xi −μ̂) .
e
There is an interpretation for the parameter θ (Efron, 1981) as follows. Since the {wi }
determine an estimate μ̂, we may interpret μ̂ as indexed by θ as a contender for being
in a confidence interval for μ. Hence, varying θ leads to different estimates as one would
find in a confidence interval.
Suppose now that we fix the {xi } and resample them n times using weights {wi }
so that
P (X = xi ) = wi .
64
4. Smooth Goodness of Fit Tests
L
Let ni be the number of occurrences of xi with i=1 ni = n. Then the bootstrap
distribution corresponds to the multinomial distribution
n! "
P (n1 , . . . , nk ) = wini
n1 ! . . . nL !
n! " eθni (xi −μ̂)
= n
n1 ! . . . nL ! ( eθ(xi −μ̂) ) i
n! eθ ni (xi −μ̂)
= θ(x −μ̂) n
n1 ! . . . nL ! ( e i )
n! ∗
= eθ ni (xi −μ̂)−nK (θ)
n1 ! . . . nL !
where
K ∗ (θ) = log eθ(xi −μ̂) .
n! " 1 ni
P0 (n1 , . . . , nk ) =
n1 ! . . . n L ! L
n
n! 1
=
n1 ! . . . n L ! L
P (n1 , . . . , nL )
= eθ ni (xi −μ̂)−nK(θ)
P0 (n1 , . . . , nL )
∗
= eθn(μ −μ̂)−nK(θ)
where
1 θ(xi −μ̂)
K (θ) = log e
L
and
ni
μ∗ = xi
n
is the bootstrapped value. Consequently, the bootstrapped distribution of μ∗ is centered
at the observed mean. These considerations have shown that an exponentially tilted
distribution arises in a natural way in an estimation context.
65
4. Smooth Goodness of Fit Tests
where
h (x) = (h1 (x) , . . . , hd (x))
and K (θ) is the normalizing constant depending on θ, chosen to make π (x, θ) a proper
density and the {hj (x)} consist of orthonormal polynomials that satisfy
1
hi (x) dx = 0 (4.2)
0
1
0 i = j
hi (x) hj (x) dx = δij = (4.3)
0 1 i = j.
66
Table 4.1.: Orthogonal functions for various densities
−λ x
-
Poisson: exp λ /x! Poisson-Charlier h∗j (x; λ) = λj /j! jt=0 (−1)j−t (xt )!λ−t jt
j j
67
Exponential: e−x h∗j (x) = t=0 t (−x)t /t!
ex d j
Gamma: xα−1 exp(−x/β)
Γ(α)β α
Laguerre polynomials h∗j (x) = j! dxj
(e−x xj )
−M
(Mx )(Nn−x ) 2x
Hypergeometric N Chebyshev polynomials h0 (x, n) = 1, h1 (x, n) = 1 − n
,x = 0, 1, . . . , n
(n)
4. Smooth Goodness of Fit Tests
Under the model in (4.1), the test for uniformity then becomes a test of
H0 : θ = 0 vs H1 : θ = 0. (4.4)
For small values of θj close to 0, we see that π (x; θ) is close to the uniform density.
Hence, the model in (4.1) provides a “smooth” transition from the null hypothesis and
as we shall later see, leads to tests with optimal properties. We note that under this
formulation, the original nonparametric problem has been cast as a parametric one. The
next theorem specifies the test statistic.
Theorem 4.1. Let X1 , . . . , Xn be a random sample from (4.1). The score test for testing
(4.4) rejects whenever
d
Uj2 > cα ,
j=1
where
1
n
Uj = √ hj (Xi )
n i=1
d
n
l (θ; x) = θj hj (xi ) − nK (θ)
j=1 i=1
∂l (θ; X)
U (θ; X) =
∂θ
with jth component
∂l (θ; x) √ ∂K (θ)
= nuj − n
∂θj ∂θj
Differentiating with respect to θj and evaluating the derivative at θ = 0:
1
∂
π (x; θ) dx = 1
∂θj 0
leads to
∂K (θ)
=0
∂θj
68
4. Smooth Goodness of Fit Tests
∂l (θ; x) √
= nuj .
∂θj
Also, the (i, j) component of the information matrix In (θ) in view of (4.3) evaluated
at θ = 0 is
∂ 2 K (θ)
= nδij .
∂θi ∂θj
It follows that the score test statistic at θ = 0 is
d
U (θ; X) In−1 (θ) U (θ; X) = Uj2 .
j=1
Lemma 2.5 then yields the asymptotic distribution of the score statistic. An alternative
direct proof makes use of the fact that the Ui s are sums of i.i.d. random variables.
The change of measure or exponential tilting model introduced in (4.1) has been
used in rare event simulation (Asmussen et al., 2016) as well as in rejection sampling
and importance sampling. We illustrate the latter use in the following example.
where c > 0 and I (.) is the indicator function. This can be done in one of two ways.
In the first case, we may take a random sample of size n and calculate the unbiased
estimator
1
n
p̂ = I (Xi > c)
n i=1
whose variance is equal to
p (1 − p) 1+ ,
= E [I (X > c)] − p2 . (4.5)
n n
Alternatively, we note that
∞
1 (x − c)2
p = Λ (x) √ exp − dx
c 2πσ 2σ 2
≡ E2 [Λ (X) I (X > c)]
69
4. Smooth Goodness of Fit Tests
where
2
exp − 2σ
√1
2πσ
x
2
Λ (x) =
(x−c)2
√ 1 exp −
2πσ 2σ 2
1
= exp − 2 c (2x − c)
2σ
is the likelihood ratio. We may now take a random sample of size
n and calculate
the
(x−c)2
unbiased estimator with respect to the change of measure √2πσ exp − 2σ2
1
:
1
n
p̂c = Λ (Xi ) I (Xi > c)
n i=1
whose variance is 1+ ,
E2 Λ (X)2 I (X > c) − p2 . (4.6)
n
Since on the set I (X > c)
Λ (X) ≤ 1
then the inequality
E2 Λ (X)2 I (X > c) ≤ E2 [Λ (X) I (X > c)] = E1 [I (X > c)]
shows that the variance (4.6) of the second estimator will not exceed that of the first
in (4.5). See Siegmund (1976) for further discussion on importance sampling and its ap-
plication to the calculation of error probabilities connected to the sequential probability
ratio test.
In this book, we shall make use of exponential tilting to describe many common
nonparametric statistics. Setting
we record in Table 4.2 various examples of the change of measure for different densities
gX (x) where the latter is determined from π (x; θ) under θ = 0. In all cases, the normal-
izing constant K (θ) is the cumulant or log of the moment generating function computed
under gX (x)
70
Table 4.2.: Examples of change of measure distributions
Density gX (x) π (x; θ) K (θ)
θ2 σ2
N (μ, σ 2 ) N (μ + θσ 2 , σ 2 ) 2
+ θμ
θ
β exp (−βx),x > 0, β > 0 (β − θ) exp (− (β − θ) x) , θ < β − log 1 − β
n n peθ
p
px (1 − p)n−x , x = 0, . . . , n p
pxθ (1 − pθ )n−x , x = 0, . . . , n,0 < pθ < 1, pθ = − log (1 − pθ )
1 − p + peθ
e−μ μx e−μθ μx
71
θ
x!
, μ > 0, x = 0, 1, . . . x!
, μθ = μeθ , x = 0, 1, . . . μ eθ
− 1
β α (β−θ)α α−1 −(β−θ)x
Γ (α)
xα−1 e−βx ,α, β > 0 Γ (α)
x e ,α, β − θ > 0 −α log 1 − βθ
θ2 Σ
N (μ, Σ) N (μ + Σθ, Σ) 2
+ θμ
r/2
1 (1/2−θ) 1
2r/2 Γ (r/2)
xr/2−1 e−x/2 , x, r >0 Γ (r/2)
xr/2−1 e−(1/2−θ)x ,r > 0, θ < 1/2 (−r/2) log (1 − 2θ) , θ < 2
4. Smooth Goodness of Fit Tests
peθ
(1 − p)x−1 p, x = 1, 2, . . . ; p > 0 (1 − pθ )x−1 pθ , x = 1, 2, . . . ; pθ = (1 − θ − log (1 − p)) > 0 log
1 − (1 − p) eθ
4. Smooth Goodness of Fit Tests
Example 4.2. Suppose that we are given a random sample of size n from a normal
distribution with mean μ and variance σ 2 . We would like to test a null hypothesis on
the mean μ. Under the smooth alternative formulation,
2
1 (x − μ − θσ 2 )
π (x; θ) = √ exp − .
2πσ 2σ 2
Based on a random sample of size n, the log of the likelihood function as a function of
θ is proportional to n 2 2
nθ (x̄n − μ) − σ θ ,
2
where x̄n is the sample mean. Setting the derivative of the log with respect to θ equal
to 0 shows that the maximum likelihood estimate of θ is
(x̄n − μ)
θ̂ = .
σ2
Consequently, θ̂ represents the shift of the sample mean from the null hypothesis mean
μ0 as specified by the density gX (x). The score statistic is given by the derivative of the
log likelihood evaluated at θ = 0:
n [x̄n − μ] .
Moreover, from the above equation, the information function is given by K (0) =
nσ 2 . Hence the Rao score statistic from Theorem 2.7 becomes
−1
W = n2 X̄n − μ (K (0)) X̄n − μ (4.7)
n 2
= 2
X̄n − μ . (4.8)
σ
The null hypothesis is rejected for large values of W which has asymptotically as n → ∞,
a chi-square distribution with one degree of freedom. The advantage of the Rao score
test statistic (4.7) is that all the derivatives are computed under θ = 0. Asymptotically, it
is equivalent to both the generalized likelihood ratio test and the Wald test as indicated
in Theorem 2.7. It is seen then that (4.8) is the usual two-sided test statistic for the
mean of a normal. We may generalize the previous sections to the case of composite
hypotheses.
72
4. Smooth Goodness of Fit Tests
i −1
i i+m
m x n
hi (x, n) = (−1) , (4.9)
m=0
m m m m
where 0 ≤ i ≤ n (Ralston (1965), p. 238). There is also an ascending factorial form. For
these polynomials we have the recursion
2x
h0 (x, n) = 1, h1 (x, n) = 1 − ,
n
(i + 1)(n − i)hi+1 (x, n) = (2i + 1)(n − 2x)hi (x, n) − i(n + i + 1)hi−1 (x, n).
n
g= g i εi
i=0
where the vector g = (g(0), g(1), . . . , g(n)) and the gi are obtained from the relation
g εi
gi = .
||εi ||2
Alternatively, for each x
n
g(x) = gi hi (x, n) for x = 0, 1, . . . , n.
i=0
The connection between the Chebyshev polynomials and the hypergeometric distribution
is given by the following theorem proven in Alvo and Cabilio (2000).
Theorem 4.2. The expected value of a function of a hypergeometric variable with pa-
rameters (M, N, n) is equal to a linear combination of the first n Chebyshev polynomials
73
4. Smooth Goodness of Fit Tests
in parameters M, N. as
n
E [g(X)] = gi hi (M, N ).
i=0
Proof. The proof uses the representation in (4.9) and proceeds by showing that
n
hi (x, n)p(x; M, N ) = hi (M, N ) for all i = 0, 1, . . . , n, and M = 0, 1, . . . , N.
x=0
n
p(x; M, N ) = gi hi (M, N ). (4.10)
i=0
x = 0, 1, . . . , n and M = 0, 1, . . . , N.
Alvo and Cabilio (2000) discuss similar results to the binomial distribution.
where
ρ
θ = log , K (θ) = n log 1 + eθ
1−ρ
Here our interest would be in testing the null hypothesis that ρ = 0.5 corresponding
to θ = 0.
74
4. Smooth Goodness of Fit Tests
model
k
f (x; θ, β) = exp θi hi (x; β) − K (θ, β) g (x; β) ,
i=1
∂K (θ, β)
= Eθ [hi (X; β)]
∂θi
and
∂K 2 (θ, β)
= Covθ [hi (X; β) , hj (X; β)] .
∂θi ∂θj
The score test statistic is then given by
−1
S β̂ = U β̂ Σ̂ U β̂ ,
where β̂ is the maximum likelihood estimate of β and U β̂ has rth element
n
i=1 h j X i ; β̂ . Here Σ̂ is the estimated asymptotic covariance matrix of U β̂ .
We refer the reader to (Rayner et al. (2009b), p. 100) for further details.
Example 4.5. Suppose that we have a random sample X1 , . . . , Xn from a normal distri-
bution with mean μ and variance σ 2 ; set β = (μ, σ 2 ) . A smooth test for the hypothesis
H0 : μ = μ0
against
H1 : μ = μ0
can be constructed using the normalized Hermite polynomials from Table 4.1 which
satisfy
∞
x2
hi (x) e− 2 dx = 0,
∞ −∞ √
x2
hi (x) hj (x) e− 2 dx = 2πδrs .
−∞
75
4. Smooth Goodness of Fit Tests
+ ,
The Hermite polynomials with respect to the distribution of X are given by hi x−μ
σ
.
The maximum likelihood estimates of β are
2
Xi − X̄n
β̂ = X̄n , .
n
We note that
2
k
V̂j2 ,
j=3
where ⎛ ⎞
1 ⎜ Xi − X̄n ⎟
n
V̂j = √ hj ⎜ 1 ⎟.
n i=1 ⎝ (Xi −X̄n )2 ⎠
n
where xj is the jth value of a k-dimensional random vector X and p = (pj ) denotes
the vector of probabilities under the null distribution when θ = θ 0 . Here K (θ) is a
normalizing constant for which
π (xj ; θ) = 1.
j
76
4. Smooth Goodness of Fit Tests
where the expectation is with respect to the model (4.11). As we shall see in Chapter 5,
this particular situation arises often when dealing with the nonparametric randomized
block design. Define
π (θ) = (π (x1 ; θ) , . . . , π (xm ; θ)) .
H0 : θ = 0 vs H1 : θ = 0.
Letting N denote a multinomial random variable with parameters (n, π (θ)), we see
that the log likelihood as a function of θ is, apart from a constant, proportional to
m
m
nj log (π (xj ; θ)) = nj (θ xj − K(θ))
j=1 j=1
m
= θ nj xj − nK(θ)
j=1
m
1 ∂πj (θ)
U (θ; X) = Nj
j=1
πj (θ) ∂θ
= T (N − np)
1 1 L
[T (N − np)] Σ−1 [T (N − np)] = (N − np) T Σ−1 T (N − np) −
→ χ2r (4.14)
n n
−1
where r = rank T Σ T .
77
4. Smooth Goodness of Fit Tests
Example 4.7 (Pearson’s Goodness of Fit Statistic (Rayner et al. (2009b), p. 68)). We
shall show that the Pearson goodness of fit statistic is given by
m
(Nj − npj )2 L 2
−
→ χm−1 ,
j=1
npj
where pj = 1 may be obtained using the smooth model formulation.
Define the random vector x∗ as
∗ x
x =
1
and
K (0) = x∗j x∗j pj
= T ∗ (diag (pj )) T ∗ (4.15)
= Im
78
4. Smooth Goodness of Fit Tests
and
1m (diag (pj )) 1m = pj = 1
1m (diag (pj )) T = xj pj = 0,
then
T (diag (pj )) T = I m−1 .
Now from (4.15)
diag p−1
j = T∗ T∗
= T T + 1m 1m .
1 1
[(N − np)] T T [(N − np)] = (N − np) diag p−1 j − 1m 1m (N − np)
n n
m
2
m
(Nj − npj )2 1
= − (Nj − npj )
j=1
npj n j=1
m
(Nj − npj )2
= .
j=1
npj
79
4. Smooth Goodness of Fit Tests
The usual properties of a distance function between two rankings μ and ν are: (1)
reflexivity: d(μ, μ) = 0; (2) positivity: d(μ, ν) > 0 if μ = ν; and (3) symmetry:
d(μ, ν) = d(ν, μ). For rankings, we often require that the distance, apart from having
these usual properties, must be right invariant,
The requirement right invariance ensures that a relabeling of the objects has no effect on
the distance. If a distance function satisfies the triangle inequality d(μ, ν) ≤ d(μ, σ) +
d(σ, ν), the distance is said to be a metric. There are several examples of distance
functions that have been proposed in the literature. Here are a few:
Spearman:
1
t
dS (μ, ν) = (μ(i) − ν(i))2 (4.16)
2 i=1
Kendall:
dK (μ, ν) = {1 − sgn (μ(j) − μ(i)) sgn (ν(j) − ν(i))} (4.17)
i<j
where I(.) is the indicator function taking values 1 or 0 depending on whether the
statement in brackets holds or not.
Spearman Footrule:
t
dF (μ, ν) = |μ(i) − ν(i)| (4.19)
i=1
Cayley:
dC (μ, ν) = n − #cycles in ν ◦ μ−1
or equivalently, it is the minimum of transpositions needed to transform μ into ν. Here,
μ−1 = μ−1 (1) , . . . , μ−1 (t) denotes the inverse permutation that displays the objects
receiving a specific rank.
Note that the Spearman Footrule, Kendall, Hamming, and Cayley distances are
metrics but the Spearman distance, like the squared Euclidean distance, is not since it
does not satisfy the triangular inequality property. We shall nonetheless for convenience
refer to it as a distance function in this book. The Kendall distance counts the number
of “discordant” pairs whereas the Hamming distance counts the number of “mismatches.”
The Hamming distance has found uses in coding theory.
80
4. Smooth Goodness of Fit Tests
Let M = d μi , μj denote the matrix of all pairwise distances. If d is right
invariant, then it follows that there exists a constant c > 0 for which
M 1 = (ct!)1
where 1 = (1, 1, . . . , 1) is of dimension t!. Hence, c is equal to the average distance. It
is straightforward to show that for the Spearman and Kendall distances
t(t2 − 1) t(t − 1)
cS = , cK = .
12 2
Turning attention to the Hamming distance, we note that if e = (1, 2, . . . , t) , then
and hence cH = t − 1.
Example 4.8. Suppose that t = 3, and that the complete rankings are denoted by
μ1 = (1, 2, 3) , μ2 = (1, 3, 2) , μ3 = (2, 1, 3) , μ4 = (2, 3, 1) , μ5 = (3, 1, 2) , μ6 = (3, 2, 1)
Using the above order of the permutations, we may write the matrix M of pairwise
Spearman, Kendall, Hamming, and Footrule distances respectively as
⎛ ⎞
0 1 1 3 3 4
⎜ ⎟
⎜ 1 0 3 1 4 3 ⎟
⎜ ⎟
⎜ 1 3 0 4 1 3 ⎟
MS = ⎜ ⎟
⎜ 3 1 4 0 3 1 ⎟
⎜ ⎟
⎝ 3 4 1 3 0 1 ⎠
4 3 3 1 1 0
⎛ ⎞
0 2 2 4 4 6
⎜ ⎟
⎜ 2 0 4 2 6 4 ⎟
⎜ ⎟
⎜ 2 4 0 6 2 4 ⎟
MK = ⎜ ⎟
⎜ 4 2 6 0 4 2 ⎟
⎜ ⎟
⎝ 4 6 2 4 0 2 ⎠
6 4 4 2 2 0
81
4. Smooth Goodness of Fit Tests
⎛ ⎞
0 2 2 3 3 2
⎜ ⎟
⎜ 2 0 3 2 2 3 ⎟
⎜ ⎟
⎜ 2 3 0 2 2 3 ⎟
MH = ⎜ ⎟
⎜ 3 2 2 0 3 2 ⎟
⎜ ⎟
⎝ 3 2 2 3 0 2 ⎠
2 3 3 2 2 0
⎛ ⎞
0 2 2 4 4 4
⎜ ⎟
⎜ 2 0 4 2 4 4 ⎟
⎜ ⎟
⎜ 2 4 0 4 2 4 ⎟
MF = ⎜ ⎟
⎜ 4 2 4 0 4 2 ⎟
⎜ ⎟
⎝ 4 4 2 4 0 2 ⎠
4 4 4 2 2 0
Hamming:
t t
1 1
AH (μ, ν) = I [μ(i) = j] − I [ν(i) = j] − (4.23)
i=1 j=1
t t
Footrule:
t t
j j
AF (μ, ν) = I [μ(i) ≤ j] − I [ν(i) ≤ j] − (4.24)
i=1 j=1
t t
The similarity measures may also be interpreted geometrically as inner products which
sets the groundwork for defining correlation (Alvo and Yu (2014)).
It is reasonable to assume that in a homogeneous population of judges, most of
the judges will have rankings close to a modal ranking μ0 . According to this frame-
work, Diaconis (1988a) developed a class of distance-based models over the set of all t!
rankings P:
e−λd(μ,μ0 )
π(μ|λ, μ0 ) = , μ ∈ P, (4.25)
C(λ)
82
4. Smooth Goodness of Fit Tests
1
n
d(μk , μ0 ) = Eλ̂,σ [d(μ, μ0 )], (4.26)
n k=1
which equates the observed mean distance with the expected distance calculated under
the distance-based model in (4.24).
The MLE can be found numerically because the observed mean distance is a constant
and the expected distance is a strictly decreasing function of λ̂. For the ease of solving,
we re-parametrize λ with φ where φ = e−λ . The range of φ lies in (0, 1] and the value of
φ̂ can be obtained using the method of bisection. Critchlow (1985) suggested applying
the method with 15 iterations, which yields an error of less than 2−15 . Also, the central
limit theorem holds for the MLE λ̂, which is shown in Marden (1995).
83
4. Smooth Goodness of Fit Tests
If the modal ranking μ0 is unknown, it can be estimated by the MLE μ̂0 which
minimizes the sum of the distances over P, that is:
n
μ̂0 = argmin d(μk , μ0 ). (4.27)
μ0 ∈P
k=1
For large values of t, a global search algorithm for the MLE μ̂0 is not practical because
the number of possible rankings is too large. Instead, as suggested in Busse et al.
(2007), a local search algorithm should be used. They suggested iteratively searching
for the optimal modal ranking with the smallest sum of distances nk=1 d(μk , μ0 ) over
μ0 ∈ Π(m) , where Π(m) is the set of all rankings having a Cayley distance of 0 or 1 to
the optimal modal ranking found in the mth iteration:
(m+1)
n
μ̂0 = argmin d(μk , μ0 ).
μ0 Π(m) k=1
(0)
A reasonable choice of the initial ranking μ̂0 can be formed by ordering the mean ranks.
Recently, Yu and Xu (2018) found in their simulation that this method may cause the
(m+1)
μ̂0 stuck at a local minimum and cannot reach the global minimum. Yu and Xu
(2018) proposed to use simulated annealing, a faster algorithm to find the global solution
of the minimization problem in (4.27). Their simulation results revealed that simulated
annealing algorithm always performs better than the local search algorithm even when
the number of objects t becomes large. The local search algorithm generally performs
satisfactory for small t but its performance deteriorates heavily when t gets large, say
t ≥ 50.
Distance-based models can handle partially ranked data in several ways, with some
modifications in the distance measures. Beckett (1993) estimated the model parameters
using the EM algorithm. On the other hand, Adkins and Fligner (1998) offered a non-
iterative maximum likelihood estimation procedure for Mallows’ φ-model without using
the EM algorithm. Critchlow (1985) suggested replacing the distance metric d by the
Hausdorff metric d∗ . The Hausdorff metric between two partial rankings μ∗ and σ ∗
equals
d∗ (μ∗ , σ ∗ ) = max[ max∗ min∗ d(μ, σ), max∗ min∗ d(μ, σ)]. (4.28)
μ ∈μ σ ∈σ σ ∈ σ μ ∈μ
t−1
d(μ, σ) = di (μ, σ), (4.29)
i=1
84
4. Smooth Goodness of Fit Tests
where the di (, σ)’s are independent. Note that Kendall distance and Cayley distance
can be decomposed in this form.Fligner and Verducci (1986) developed two new classes
of ranking models, called φ-component models and cyclic structure models, for the de-
composition.
Fligner and Verducci (1986) showed that the Kendall distance satisfies (4.29):
t−1
dK (μ, μ0 ) = Vi , (4.30)
i=1
where
t
Vi = I{[μ(μ−1 −1
0 (i)) − μ(μ0 (j))] > 0}. (4.31)
j=i+1
We note that Vj has the uniform distribution on the integers 0, 1, . . . , t − j (see Feller
(1968), p. 257), and V1 represents the number of adjacent transpositions required to
place the best object in μ0 in the first position, then remove this item in both μ and
μ0 , and V2 is the number of adjacent transpositions required to place the best remaining
object in μ0 in the first position of the remaining items, and so on. Therefore, the
ranking can be described as t − 1 stages, V1 to Vt−1 , where Vi = m can be interpreted
as m mistakes made in stage i.
By applying the dispersion parameter λi at stage Vi , the Mallow’s φ-model can be
extended to: t−1
e− i=1 λi Vi
π(μ|λ, μ0 ) = , (4.32)
CK (λ)
where λ = {λi , i = 1, . . . , t − 1} and the normalizing constant CK (λ) is equal to
"
t−1
1 − e−(t−i+1)λi
. (4.33)
i=1
1 − e−λi
85
4. Smooth Goodness of Fit Tests
Based on a ranking data set {μk , k = 1, . . . , n} and a given modal ranking μ0 , the
maximum likelihood estimates λ̂i , i = 1,2, . . . ,t − 1 can be found by solving the equation
1
n
e−λ̂i (t − i + 1)e−(t−i+1)λ̂i
Vk,i = − , (4.34)
n k=1 1 − e−λ̂i 1 − e−(t−i+1)λ̂i
where
t
Vk,i = I{[μk (μ−1 −1
0 (i)) − μk (μ0 (j))] > 0}. (4.35)
j=i+1
The left- and right-hand sides of (4.33) can be interpreted as the observed mean and
theoretical mean of Vi respectively.
The extension of distance-based models to t − 1 parameters allows more flexibility
in the model, but unfortunately, the symmetric property of distance is lost. Notice here
that the so-called “distance” in φ-component models can be expressed as
λi I{[μ(μ−1 −1
0 (i)) − μ(μ0 (j))] > 0}, (4.36)
i<j
which is obviously not symmetric, and hence it is not a proper distance measure. For
example, in φ-component model, let μ = (2, 3, 4, 1), μ0 = (4, 3, 1, 2).
The symmetric property of distance is not satisfied. Lee and Yu (2012) and Qian and
Yu (2018) introduced new weighted distance measures which can retain the properties
of a distance and also allow different weights for different ranks.
t−1
dC (μ, μ0 ) = Xi (μ, μ0 ), (4.37)
i=1
86
4. Smooth Goodness of Fit Tests
will then be dC (μ, μ0 ), and it can be decomposed as the sum of costs of opening locker
i, i = 1,2, . . . t − 1, which equals Xi (μ, μ0 ).
If we relax the assumption that the costs of breaking every locker are equal, the total
cost will become
t−1
θi Xi (μ, μ0 ), (4.38)
i=1
where θi is the cost of opening locker i. This “total cost” can be interpreted as a
weighted version of Cayley’s distance. Similar to the extension of Mallow’s φ models to
φ-component models, Fligner and Verducci (1986) developed the cyclic structure models
using the weighted Cayley’s distance. Under this model assumption, the probability of
observing a ranking μ is
t−1
θi Xi (μ,μ0 )
e− i=1
π(μ|θ, μ0 ) = , (4.39)
CC (θ)
"
t−1
{1 + (t − i)e−θi }. (4.40)
i=1
For a ranking data set {μk , k = 1, . . . , n} with a given modal ranking μ0 , the MLEs
θ̂i , i = 1,2, . . . ,t − 1 can be found from the equation
X̄i
θ̂i = log(t − i) − log , (4.41)
1 − X̄i
where n
k=1 Xi (μk , μ0 )
X̄i = . (4.42)
n
87
4. Smooth Goodness of Fit Tests
where
r
c
r
c
pij = pi. = p.j = 1.
i=1 j=1 i=1 j=1
The {gu (i)} are orthonormal functions on the marginal row probabilities and the {hv (j)}
are orthonormal functions on the marginal column probabilities. The test for indepen-
dence is then a test of
H0 : θ = 0 vs H1 : θ = 0
where θ = (θ11 , . . . , θ1k2 , . . . , θk1 1 , . . . , θk1 k2 ) .
Set
r c
√
V̂uv = Nij ĝu (i) ĥv (j) / n
i j
r c
where i j Nij = n. Here, {ĝu (i)} are the set of polynomials orthogonal to {p̂i. } where
( )
p̂i. = j Nij /n. Similarly, ĥv (j) are the set of polynomials orthogonal to {p̂.j } where
p̂,j = i Nij /n. The following theorem, proven in (Rayner et al. (2009b)), shows that
we may obtain the usual test statistic as a consequence of the smooth model (4.43).
1 k2
Theorem 4.3. The score statistic for testing H0 vs H1 is given by ku=1 2
v=1 V̂uv where
under the null hypothesis, the components V̂uv are asymptotically i.i.d. standard normal
variables.
When k1 = (r − 1) , k2 = (c − 1) , the test statistic is the usual Pearson statistic
Chapter Notes
1. Smooth tests for goodness of fit were introduced by Neyman (1937) when no nui-
sance parameters were involved. Since a probability integral transformation can
transform a distribution to a uniform, the orthogonal polynomials were taken to
be those of Legendre. The tests developed were locally uniformly most powerful,
symmetric, and unbiased. A good introduction along with several references are
given in Rayner et al. (2009b). See also Rayner et al. (2009a) and Rayner et al.
(2009b) for generalizations and extensions of smooth tests of fit including smooth
tests in randomized blocks with ordered alternatives.
88
4. Smooth Goodness of Fit Tests
2 5
k+1 k2 − 1 180
g2 (j) = j− − .
2 12 (k 2 − 1) (k 2 − 4)
Higher order components may be obtained using the recurrence equations as de-
scribed in (Rayner et al. (2009b), p. 243). Additional polynomials may be obtained
from the usual three term recursion formulas (Kendall and Stuart, 1979).
3. We note that the smooth testing approach here leads to a study of global properties
of the data. This is to be contrasted with the approach in Chapter 7 whereby we
will consider smooth tests that incorporate specific score functions such as those
of Spearman and Kendall. In those instances, the vector parameter θ places a
weighting on the components of the score function. The approach enables us to
study more precisely local properties.
4.6. Exercises
Exercise 4.1. Prove Theorem 4.2.
Exercise 4.2. Suppose that we are given the smooth binomial distribution given in
Table 4.2. Find the score statistic to test the hypothesis that θ = 0.
Exercise 4.3. Repeat Exercise 4.2 using a sample of size n from the smooth Poisson
distribution given in Table 4.2 and test the hypothesis that θ = 0.
89
5. One-Sample and Two-Sample
Problems
In this chapter we consider several one- and two-sample problems in nonparametric
statistics. Our approach will have a common thread. We begin by embedding the
nonparametric problem into a parametric paradigm. This is then followed by deriving
the score test statistic and finding its asymptotic distribution. The construction of
the parametric paradigm often involves the use of composite likelihood. It will then
be necessary to rely on the use of either linear rank statistics or U -statistics in order
to determine the asymptotic distribution of the test statistic. We shall see that the
parametric paradigm provides new insights into well-known problems. Starting with
the sign test, we show that the parametric paradigm deals easily with the case of ties.
We then proceed with the Wilcoxon signed rank statistic and the Wilcoxon rank sum
statistic for the two-sample problem.
Under the parametric paradigm, we first define a score function, sensitive to changes
in the median. Let Y = sgn (X − M0 ) and define the change of measure by
where g(y) represents the null probability distribution of Y . Let g(1) = p+ , g(0) = p0
and g(−1) = p− such that
p+ + p0 + p− = 1.
p+ eθ + p− e−θ + p0 = eK(θ) .
H0 : π (1; θ) = π (−1; θ)
versus
H1 : π (1; θ) > π (−1; θ)
or equivalently
H0 : θ = 0 versus H1 : θ = 0
In a sample of size n, let n+ , n− , n0 denote the observed number of cases where
y = 1, −1, 0, respectively. Note that under H0 , the score function is given by
U (θ; X) = n+ − n−
and the Fisher information is I = nK (0) = n(1− p̂0 ) = n−n0 . The score test statistic is
2 2
[U (θ; X)]2 (n+ − n− )2 4 n+ + n20 − n2 4 n+ − n−n
2
0
Ssign = = = = . (5.3)
I n − n0 n − n0 n − n0
Consequently large values of Ssign lead to rejection of the null hypothesis H0 . We see
then that under H0 ,
L
Ssign −
→ χ21
as n → ∞.
Remark 5.1. We note that the sign test takes into account situations where ties (i.e.,
Xi = M0 ) are possible and the score function leads naturally to the statistic n+ + n20
often suggested without justification in the literature. Note that the last expression of
the score test statistic Ssign in (5.3) seems to recommend the usual treatment of ties,
namely to reduce the sample size by discarding the tied observations.
Remark 5.2. In the case of no ties (i.e., n0 = p0 = 0), the score test statistic Ssign
only depends on n+ which has a binomial distribution with probability p+ , i.e., n+ ∼
Bin(n, p+ ). Under H0 , we have p+ = 0.5 and hence n+ ∼ Bin(n, 0.5). For H1 : M > M0 ,
we have p+ > 0.5 and hence large values of n+ will lead to rejection of H0 . The p-value
of the sign test is then Pr(B ≥ n∗+ ) where B ∼ Bin(n, 0.5) and n∗+ is the observed value
n+ . Similarly, for H1 : M < M0 , the p-value is Pr(B ≤ n∗+ ) and for H1 : M = M0 , the
p-value is 2 × min{Pr(B ≥ n∗+ ), Pr(B ≤ n∗+ )}.
92
5. One-Sample and Two-Sample Problems
Example 5.1. The weekly sales of new mobile phones at a mobile phone shop in a mall
are collected over the past 12 weeks. The number of phones sold is recorded for each of
12 weeks and are given below:
Last year, the median weekly sales was 50 units. Is there sufficient evidence to conclude
that median sales this year are higher than last year? Test the hypothesis at a significance
level α = 0.05.
Solution. Let M be population median weekly sales this year. The hypotheses are
n = 12 and n+ = 7. Set B ∼ Bin(12, 0.5). Then, the p-value for the test is given by
12
12
p−value = Pr(B ≥ 7) = 0.512 = 0.3872.
i=7
i
Since the p-value is greater than 0.05, we do not have enough evidence to reject H0 at
the 5% level of significance. Thus, there are no grounds upon which to conclude that
the median sales are now higher than 50 units per week.
Remark 5.3. In the presence of ties, it is easy to show that under H0 , n+ + n20 has mean
n
2
and variance n4 (1 − p0 ) but it no longer has a binomial distribution. When the sample
size n is large, we can apply the normal approximation to its null distribution.
93
5. One-Sample and Two-Sample Problems
which is close to the exact p-value 0.040. Since the p-value is less than 0.05, we conclude
that the median amount of sodium per serving is greater than 75 mg at the 5% level of
significance.
In R, the function pnorm(x,μ,σ) will calculate Pr(X ≤ x) where X ∼ N (μ, σ 2 ).
Remark 5.4. Since only the signs of {Xi − M 0 } but not its magnitude are used, the sign
test has the advantage that it can be utilized when only the signs are available. The
test is also robust against outliers, as only the sign of the outlier is of interest no matter
how far it is away from M0 . When either p0 = 0 or n0 = 0, we obtain the usual sign test
(Hájek and Sidak (1967), p. 110).
Remark 5.5 (Paired Comparisons). There are several applications of the sign test for
paired comparisons whereby the null hypothesis is that the distribution of the difference
Z = Y − X is symmetric about 0. As a special case, we can consider the shift model
where Z = (Y − Δ) − X so that the treatment adds a shift of value Δ to the control.
The asymptotic properties of the sign test are discussed in several textbooks. See, for
example, Chapter 14 of van der Vaart (2007) and Chapter 4 of Lehmann (1975).
(b) Let X(1) < X(2) < · · · < X(n) be the order statistics.
We may show
Pr(X(cα ) < M < X(n+1−cα ) ) = 1 − α.
By definition, cα is chosen such that
Pr(cα ≤ B ∗ ≤ n − cα ) = 1 − α.
94
5. One-Sample and Two-Sample Problems
Now
Similarly,
Then, if B ∗ ∼ Bin(7, 12 ),
n
n 12
cα ≈ − z α2
2 4
since we have
B∗ − n
Pr −z α2 ≤ 1
2
≤ z α2 =1−α
( n4 ) 2
n 12
n 12
n n∗
⇔ Pr − z α2 ≤ B ≤ + z α2 = 1 − α.
2 4 2 4
95
5. One-Sample and Two-Sample Problems
H0 : μ = μ0 vs H1 : μ > μ0
X̄ − μ0
Z= √ .
σ/ n
For our discussion, we shall only consider the large sample case with σ known. The
probability of committing a Type I error should be at least close to the significance level
α. Though bias may occur when using an approximation, the normal approximation for
both CLT and sign tests is considered to be quite good for large samples. Therefore, the
stated probability of committing a Type I error will be essentially correct.
To compare the power of the two tests, we must consider the population distribution
under H1 . Referring to Example 5.2, let us assume the true median of sodium content
96
5. One-Sample and Two-Sample Problems
is 75.8 mg with σ = 2.5 and the sample is selected from a normal population. Then
X̄ − 75
power of CLT test = Pr( √ ≥ 1.645 | μ = 75.8)
2.5/ 40
X̄ − μ μ − 75
= Pr( √ ≥ 1.645 − √ | μ = 75.8)
2.5/ 40 2.5/ 40
75.8 − 75
= 1 − Φ(1.645 − √ ) = 0.65
2.5/ 40
Consider the sign test, if μ = 75.8, we have p ≡ Pr(X > 75) = 0.626.
B − 20
power of sign test = Pr( √ ≥ 1.645 | B ∼ Bin(40, 0.626))
0.25 × 40
= Pr( √B−40p ≥ 1.645 p(1−p)
0.25
− √40p−20 | B ∼ Bin(40, 0.626))
40p(1−p) 40p(1−p)
5
0.25 40p − 20
= 1 − Φ(1.645 −- ) = 0.48
p(1 − p) 40p(1 − p)
In this example, the CLT test is preferred since the power is greater.
Remark 5.9. When samples are taken from normal populations with a known variance,
the CLT test has the greatest power among all tests (uniformly most powerful test). It’s
the test to use when sampling from a normal population. But for nonnormal populations,
this is not the case. The sign test will have higher power than the CLT test for heavy-
tailed distributions, including the Cauchy or Laplace distributions. For example, if the
true distribution is Laplace, the power of the sign test is 0.76.
h(Zi ) = sgn(Zi ), i = 1, 2, . . . , n.
The Wilcoxon signed ranked test which takes into account both the rank and the sign is
more powerful. It is used to test the hypothesis that the distribution of X is symmetric
about M . Consider the two-variable kernel
97
5. One-Sample and Two-Sample Problems
where (Zi + Zj ) are the Walsh sums and define the smooth model
where gY is the density of Y assumed to be symmetric around 0 under the null hypothesis,
H0 : θ = 0.
n+1
(θ) ∼ θ I(zi + zj > 0) − K(θ) .
i≤j
2
n
SR+ = Ri+ I(Zi > 0),
i=1
where Ri+ be the rank of |Zi | among {|Zj | , j = 1, . . . , n}. The following lemma shows
an equivalent form between Wn+ and SR+ .
Proof. Suppose without loss of generality that the Zi s are ordered in absolute value:
Fix j and consider all the pairs (Zi , Zj ) , i ≤ j. The sum Zi + Zj > 0 if and only if
Zj > 0. The number of indices for which this is true is equal to j, the rank of |Zj |.
Though itself not a U -statistic, Wn+ is, in fact, the sum of two U -statistics:
Wn+ = I(Zi + Zj > 0)
i≤j
= I(Zi > 0) + I(Zi + Zj > 0) (5.5)
i i<j
= nU1n + (n2 ) U2n
98
5. One-Sample and Two-Sample Problems
with respective kernels I(Zi > 0) and I(Zi + Zj > 0). Under the null hypothesis, we
have that
n n (n − 1)
E I (Zi > 0) = , E I (Zi + Zj > 0) =
i
2 i<j
4
and hence
n (n + 1)
E Wn+ = .
4
As well,
Example 5.4. A speed-typing course was given to 18 clerks in the hope of training them
to be more efficient typists. Each clerk was tested on her typing speed (in w.p.m.) before
and after the course. The results are given in the table below. With a 1% significance
level use Wilcoxon signed rank test to see whether the course was effective.
After 53 33 54 61 55 57 40 59 58 53 59 62 51 43 64 68 48 60
Before 42 35 48 52 60 43 36 63 51 45 56 50 41 38 60 52 48 57
99
5. One-Sample and Two-Sample Problems
H0 : M = 0 against H1 : M > 0,
Step 3. The signs of the original differences are then restored to the ranks, and the sum
of the positive ranks, Wn+ = 153.5, is the value of the test statistic.
Step 4. Since H1 is a one-sided hypothesis, a one-tailed test is appropriate. As a bigger
SR+ value produces a stronger support for H1 , one can refer to the table of critical
values (see Table 20 of Lindley and Scott (1995)) with n = 18 and α = 0.01 for decision
making. This critical value is 139. We therefore reject H0 at 1% significance level and
conclude that the course was effective.
The Wilcoxon signed rank test can be implemented using R function wilcox.test. For
example, wilcox.test(x, mu=0, alternative = "greater") produces an exact p-
value of the test with alternative being the median greater than 0 if the sample contains
less than 50 observations and has no ties. Otherwise, a normal approximation is used.
Remark 5.11. One can also apply the sign test in this example. For n = 18 the data
exhibit 14 positive signs, which is insignificant at 1% level (critical value being 15). It
is interesting to see that the sign test uses less information derived from data than the
Wilcoxon signed rank test and the strength of evidence supporting H1 is also lower.
100
5. One-Sample and Two-Sample Problems
One sample would serve as the control group and the other as the treatment group. We
would like to test the null hypothesis
H0 : F = G
H1 : F = G.
It is typically assumed that in the shift model, the two distributions only differ by a
location parameter Δ:
F (x) = G (x − Δ) .
That is, X − Δ and Y have the same distribution. If the treatment effect depends on
x, then a more general alternative might be
which indicates that small values are more likely under F than under G. In that case,
we say that Y is stochastically larger than X. We consider in the next section a test for
this more general alternative.
Example 5.5. Consider the usual two-sample t-test involving independent samples
whereby X1 , X2 , . . . , Xn are iid N (μx , σ 2 ) and Y1 , Y2 , . . . , Ym are iid N (μy , σ 2 ). Then
F (x) = Φ( x−μ
σ
x
) and G(y) = Φ( y−μ σ
y
), where Φ(·) is the cdf of N (0, 1). Then it is easy
to see that Δ = μx − μy .
101
5. One-Sample and Two-Sample Problems
Example 5.6. Let X1 , . . . , Xn be a random sample from a continuous cdf F (x) and let
Y1 , . . . , Ym be an independent random sample from F (x + Δ), where Δ is finite. Let
N = n + m. We would like to test the hypotheses
H0 : Δ = 0
against
H1 : Δ > 0.
Suppose that the test statistic to be used is given by the difference in the sample means
D = X̄ − Ȳ .
Let Dobs be the observed value of D. Under the null hypothesis, the combined sample
of N observations is assumed to come from the same cdf F . Ignoring the labels which
identify the distribution from which the observations came, there would be Nn possible
samples of n observations assigned to the first population and m observations assigned
to the second. This is precisely the number of possible permutations of the labels and
each is equally likely. For each of these permutations, we may compute a value of the
statistic T thereby creating a reference distribution (which also includes the observed
Dobs ) and then count how many of these exceed the observed Dobs in order to calculate
the p-value.
Example 5.7. Fungal infection was believed to have certain effect on the eating behav-
ior of rodents. In an experiment, infected apples were offered to a group of eight ran-
domly selected rodents, and sterile apples were offered to a group of four. The amounts
consumed (grams of apple per kilogram of body weight) are listed in the table below.
Test whether fungal infection significantly reduces the amounts of apples consumed by
rodents.
Experimental Group (Xi ) 11, 33, 48, 34, 112, 369, 64, 44
Control Group (Yi ) 177, 80, 141, 132
In this example, no assumption is made about the distribution of the data. Given
that the assignment of rodents to either group is random, if there is no difference (H0 )
between the two groups, partitions of the 12 scores into two groups of sizes 8 and 4 will
be equally likely to occur. By permuting the 7 scores, we obtain a total of 12 4
= 495
partitions as follows:
102
5. One-Sample and Two-Sample Problems
Permuted Difference
Experimental Control
Samples Between Means
1 11 33 48 34 112 64 44 80 369 177 141 132 −151.500
2 11 33 48 34 64 44 80 132 112 369 177 141 −144.000
3 11 33 48 34 64 44 80 141 112 369 177 132 −140.625
4 11 33 48 34 112 64 44 132 369 177 80 141 −132.000
.. .. .. ..
. . . .
135 11 33 112 64 44 177 141 132 48 34 369 80 −43.500
136* 11 33 48 34 112 369 64 44 177 80 141 132 −43.125
137 11 34 112 64 44 177 141 132 33 48 369 80 −43.125
138 11 33 48 112 64 177 141 132 34 369 44 80 −42.000
.. .. .. ..
. . . .
48 112 369 64 177 80 141
495 11 33 34 44 109.875
132
Using the difference between the sample means, i.e., D = X − Y , as our test statistic,
we obtain the following permutation distribution of the difference:
Permutation Distribution of Difference of Means
0.010
0.008
0.006
Density
0.004
0.002
0.000
Xbar-Ybar
It can be seen from the above figure that under H0 small differences tend to occur
more frequently, which is reasonable as the difference will tend to fall around 0 if the
two methods are the same.
103
5. One-Sample and Two-Sample Problems
Remark 5.12. There are other possible choices of the test statistic in the permutation
test:
• Difference of means, D = X − Y
n
m
• Sum of the observations for one group, T1 = Xi (or T2 = Yj ). Since T =
i=1 j=1
T1 + T2 is fixed given the observed data, we have D = T1
n
− T2
m
= T1 ( n1 + 1
m
) − T
m
and thus D and T1 are equivalent test statistics.
• Difference of medians, X0.5 − Y0.5 . The median has the benefit of being robust to
a few extreme observations (called outliers)
Generally, deciding which statistic to use requires some advance knowledge of the pop-
ulation. The difference of the means is most commonly used, especially when the data
come from an approximate normal distribution. But if the population has an asymmet-
ric distribution, the median may be a more desirable indicator of the center of the data.
The difference of trimmed means is used when the distribution is symmetric but likely
to have outliers.
Here we summarize the general steps for a two-sample permutation test assuming that
a large test statistic T tends to support Δ > 0:
1. Based on the original data, compute the observed test statistic, Tobs (e.g., the
difference between the two-sample means).
2. Permute the n + m observations from the two treatments so that there are n
observations for population 1 and m observations for population 2. Obtain all
possible permutations, n+m
n
in total. Compute the value of the test statistic T
for each permutation.
104
5. One-Sample and Two-Sample Problems
no. of T s ≥ Tobs
Pupper−tail = m+n
n
no. of T s ≤ Tobs
Plower−tail = m+n
n
no. of |T | s ≥ |Tobs |
Ptwo−tail = m+n
n
4. Declare the test to be significant if the p-value is less than or equal to the desired
significance level.
Remark 5.13. The permutation test can be rather tedious as sample sizes m and n
increase. For instance, 20 10
= 184,756 is already quite large. Fortunately, there is
a simple way to obtain an approximate p-value in such cases. Rather than using all
the possible permutations, we take a random sample of size say, 1,000 out of m+n m
permutations and find the approximate p-value using the distribution formed by the
1000 statistics in the same manner as the exact permutation test.
H0 : μY = μX vs H1 : μY > μX
where g(h(x, y)) = 12 , x = y is the null density of h(X, Y ) and K(θ) satisfies
K(θ) eθ + 1
e = .
2
105
5. One-Sample and Two-Sample Problems
It follows that K(0) = 0. When θ = 0, the model specified in (5.7) indicates that X
and Y are independent and identically distributed with distribution F . Consequently,
the hypotheses in (5.6) can be expressed in terms of θ as
H0 : θ = 0 vs H1 : θ > 0.
For random samples of sizes n and m from F and G, respectively, the log of the
composite likelihood function becomes proportional to
m
n
l (θ; X, Y ) ∼ θ I(Xi < Yj ) − nmK(θ).
j=1 i=1
The score test statistic evaluated under H0 is then given by the U -statistic
m
n
U (X, Y ) = I(Xi < Yj ) = # {(Xi , Yj ) : Xi < Yj } (5.8)
j=1 i=1
which rejects H0 for large values. This is called the Mann-Whitney statistic and it is the
counting form of the statistic. It is this version which is most often used for theoretical
developments as will be seen in Chapter 8 when we consider notions of efficiency.
n
R(Yj ) = S(Yj ) + I(Yj > Xi ).
i=1
n+m
R(Yj ) = I(Yj > Zi ) + 1
i=1
n+m
= I(Yj > Zi ) [I(Zi = Yi ) + I(Zi = Xi )] + 1
i=1
m
n
= I(Yj > Yi ) + 1 + I(Yj > Xi )
i=1 i=1
n
= S(Yj ) + I(Yj > Xi ).
i=1
106
5. One-Sample and Two-Sample Problems
It can be seen from Lemma 5.2 that the score statistic is equal to
m
m
U (X, Y ) = R(Yj ) − S(Yj )
j=1 j=1
m (m + 1)
= W− , (5.9)
2
where the sum
m
W = R(Yj ) (5.10)
j=1
q1 = P (X1 < Y1 ) ,
q2 = P (X1 < Y1 , X1 < Y2 ) ,
q3 = P (X1 < Y1 , X2 < Y1 ) .
(c) Under the null hypothesis whereby F (x) = G (x) for all x,
1
q1 =
2
1
q2 = q3 =
3
and
U (X, Y ) − mn
L
2
−
→ N (0, 1) .
mn(m+n+1)
12
Proof. The proof is found in (Lehmann (1975), p. 335 and p. 364) and is a direct appli-
cation of Example 3.5.
107
5. One-Sample and Two-Sample Problems
Remark 5.14. The Mann-Whitney test is also based on the permutation test with U
as the test statistic, and its critical value can also be tabulated accordingly. See, for
example, Table 21 of Lindley and Scott (1995). To make inferences, we may either
compute the p-value with the permutation method or compare the observed Mann-
Whitney test statistic Uobs with the corresponding critical value in the table.
The Mann-Whitney test can be implemented using R function wilcox.test. For
wilcox.test(x, y, paired=FALSE): the test statistic is defined as “the number of
pairs (Xi , Yj ) for which Yj ≤ Xi .” Therefore, our Mann-Whitney U statistic can
be obtained using wilcox.test(y, x, paired=FALSE). By default, an exact p-value
is computed if the samples contain less than 50 finite values and there are no ties.
Otherwise, a normal approximation is used.
Remark 5.15. Under H0 , the m ranks associated with the Y -sample should be randomly
selected from the finite population of N = m + n ranks. From the theory of sampling
from finite population, the expected value and variance of W under H0 are given by
E(W ) = mμ
mnσ 2
V ar(W ) = ,
m+n−1
where N
i N +1
μ = i=1 = ,
N 2
N 2
i=1 (i − μ)2 12 + 22 + · · · + N 2 N +1 (N − 1)(N + 1)
2
σ = = − = .
N N 2 12
Hence, under H0 ,
m(m + n + 1)
E(W ) =
2
mn(m + n + 1)
V ar(W ) = .
12
Note that from (5.9), U (X, Y ) = W − m(m+1)
2
. It can be shown that under H0 ,
m(m + 1) mn
E(U (X, Y )) = E(W ) − =
2 2
and
mn(m + n + 1)
V ar(U (X, Y )) = V ar(W ) = .
12
108
5. One-Sample and Two-Sample Problems
Remark 5.16 (Wilcoxon Rank-Sum/Mann-Whitney Test Adjusted for Ties). The case
of ties can be dealt with by counting each tied pair as 12 in the Mann-Whitney statistic
U (X, Y ) or by assigning the average rank (also called mid-rank) to the tied observations
in the Wilcoxon rank-sum statistic W . The mean of U (X, Y ) or W does not change
but the variance of U (X, Y ) or W under H0 should be adjusted downwards:
k
mn(m + n + 1) mn i=1 d3i − di
V ar(U (X, Y )) = V ar(W ) = − , (5.11)
12 12 (m + n) (m + n − 1)
where k is the number of tied groups and di is the number of tied observations in the i
th
tied group, i = 1, 2, . . . , k. See (Lehmann (1975) p. 355).
Example 5.9. A statistics course has two tutorial classes conducted by two tutors,
Maya and Alyssa. Students of this course were given a mid-term test and some students
were randomly drawn from each of two tutorial classes and their test scores are shown
below:
Maya’s class (X) 82 74 87 86 75
Alyssa’s class (Y ) 88 77 91 88 94 93 83 94
We are testing the null hypothesis that there is no difference in statistics ability, as
measured by this test, in the two classes.
Solution. n = 5 and m = 8
Combined Data 74 75 77 82 83 86 87 88 88 91 93 94 94
Ranks 1 2 3 4 5 6 7 8.5 8.5 10 11 12.5 12.5
Data from X 82 74 87 86 75
Ranks 4 1 7 6 2
Data from Y 88 77 91 88 94 93 83 94
Ranks 8.5 3 10 8.5 12.5 11 5 12.5
109
5. One-Sample and Two-Sample Problems
12 + 22 + · · · + 12.52 + 12.52
σ = 2
− 72 = 13.92.
13
The Wilcoxon rank-sum test statistic with adjusted ranks has, under H0 ,
E(W ) = mμ = 8(7) = 56
mnσ 2 8(5)(13.92)
V ar(W ) = = = 46.41
m + n − 1. 12
Although one may apply the sampling formulas directly to the adjusted ranks for tied
data, we can also the explicit formula (5.11) for V ar(W ) under H0 . Note that there are
2 tied groups: {8.5, 8.5} and {12.5,12.5} and hence, k = 2, d1 = d2 = 2. Therefore,
k
i=1 di − di
3
mn(m + n + 1) mn
V ar(W ) = −
12 12 (m + n) (m + n − 1)
8(5)(8 + 5 + 1) 8(5)2(23 − 2)
= − = 46.67 − 0.26 = 46.41.
12 12 (13) (12)
70.5 − 56
2P (W ≥ 71) = 2P (Z > √ |Z ∼ N (0, 1)) = 0.0333.
46.41
110
5. One-Sample and Two-Sample Problems
The inequality holds if and only if at least a and at most b − 1 pairs of (Xi , Yj ) satisfy
Xi − Yj < (or Xi < Yj + ). Since Xi and Yj = Yj + have the same distribution,
we have
Pr(D(a) < ≤ D(b) ) = Pr(a ≤ U ≤ b − 1) = 1 − α.
The values of a and b can be obtained according to the null distribution of U (X, Y ) :
a = l α2 + 1, b = u a2
where l α2 and u a2 are the lower- α2 and upper- α2 percentile points of the U (X, Y ) distri-
bution.
Remark 5.17. To find a 100(1 − α)% confidence interval for for large samples, we can
use a normal approximation with continuity correction.
Let Z ∼ N (0, 1)). Then
a − 0.5 − E(U (X, Y )) b − 0.5 − E(U (X, Y ))
Pr(a ≤ U (X, Y ) ≤ b − 1) = P - <Z< - = 1 − α.
V ar(U (X, Y )) V ar(U (X, Y ))
For α = 0.05, Pr(−1.96 < Z < 1.96) = 0.95 and we can obtain a and b:
-
a = 0.5 + E(U ) − 1.96 V ar(U ),
-
b = 0.5 + E(U ) + 1.96 V ar(U ).
111
5. One-Sample and Two-Sample Problems
Suppose two machines for bottling coca cola are designed to fill the cans with 330
ml of the soft drink. It is expected that the observed data on the amount of coke in
each can from the two machines are centered around 330 ml as they should be, but their
variability may not be the same (see Figure 5.1).
0.08
Distribution with
less variability
0.06 Distribution with
more variability
0.04
0.02
0.00
Xi = μ + σ1 εi1 i = 1, · · · , m
Yj = μ + σ2 εj2 j = 1, · · · , n
where all the ε’s are i.i.d random variables with a median of 0. Note that both the X’s
and the Y ’s share the same location parameter μ. Here, we want to test H0 : σ1 = σ2 .
A nonparametric test that makes use of Wilcoxon rank-sum test is the Siegel-Tukey test
(Siegel and Tukey, 1960). The steps for carrying out the Siegel-Tukey test are as follows:
1. Arrange the combined data from smallest to largest.
2. Assign rank 1 to the smallest observation, rank 2 to the largest observation, rank
3 to the next largest observation, rank 4 to the next smallest observation, rank 5
to the next smallest observation, and so on.
112
5. One-Sample and Two-Sample Problems
Example 5.12. Consider the following two sets of data and calculate the Siegel-Tukey
test statistic:
Chapter Notes
1. The power efficiency of the sign test relative to the Student’s t test for the case of
the normal distribution is 95% for small samples. The relative efficiency appears
to decrease for increasing sample sizes and is about 75%. The relative efficiency
decreases as well for increasing level of significance and for increasing alternative.
See Wayne Daniel (1990, p. 33).
113
5. One-Sample and Two-Sample Problems
2. The asymptotic efficiency of the Wilcoxon signed test has been investigated by
Noether (see p. 43 of Daniel (1990)). The Wilcoxon signed rank test has an asymp-
totic relative efficiency of 0.955 relative to the one-sample t test if the differences
are normally distributed and an efficiency of 1 if the differences are uniformly dis-
tributed. The asymptotic efficiency of the sign test relative to the Wilcoxon signed
rank test is 2/3 if the differences are normally distributed and an efficiency of 4/3
if the differences follow the double exponential distribution.
3. Tables for the Mann-Whitney test are found in Daniel (1990), p. 508.
5.4. Exercises
Exercise 5.1. Refer to the data in Example 5.1. Last year, the average weekly sales
was 100 units. Is there sufficient evidence to conclude that sales this year exceeds last
year’s sales? Test at α = 0.05. The usual type of test to run for such a question is
a one-sample t-test. For the given problem, the correct hypotheses are H0 : μ = 100
versus H1 : μ > 100 where μ is the true mean weekly sales of new mobile phones at the
mobile phone shop this year.
(a) What is the result of the t-test? (Ans: t = 0.4048, d.f. = 11)
(c) Are these assumptions valid? You may draw the histogram of the sales.
(d) Under what conditions do we feel safe to use the above t-test?
Exercise 5.2. Fifteen patients are trying a new diet. The differences between weights
before and after the diet are (in the order from the smallest)
-7.8 -6.9 -4.0 3.7 6.5 8.7 9.1 10.1 10.8 13.6 14.4 15.6 20.2 22.4 23.5
(a) If the diet has no effect, what should the median weight loss be?
114
5. One-Sample and Two-Sample Problems
(b) Is it a one- or two-sided sign test? Hence, conduct the test at the 5% significance
level.
Exercise 5.3. Denote F (x) as the cumulative distribution function (cdf) of a continuous
random variable X. Show that for a given x, an approximate 100(1 − α)% confidence
interval for F (x) : 5
F̂ (x)[1 − F̂ (x)]
F̂ (x) ± z α2 ,
n
where F̂ (x) is the empirical cdf.
Exercise 5.4. Show that under the null hypothesis, the Wilcoxon signed rank statistic
is equivalent in distribution to the sum
W+ = Vj
j
Exercise 5.5.
(a) Two different groups of individuals were compared with respect to how quickly they
respond to a changing traffic light. Test using the two-sided Wilcoxon statistic the
hypothesis that there is no difference between the two groups on the basis of the
following data
Group 1 19.0 14.4 18.2 15.6 14.5 11.2 13.9 11.6
Group 2 12.1 19.1 11.6 21.0 16.7 10.1 18.3 20.5
(b) Do the data suggest that the population variances differ? Carry out a Siegel-Tukey
test at the 5% significance level.
(c) Suppose that those same groups are compared at a later time. Indicate how you
would combine the two data sets using the smooth embedding approach in order
to define a single test. (Hint: suppose in the embedding approach, the parameters
for the individual groups are θ1 , θ2 , respectively. Define γ1 = θ1 +θ
2
2
, γ2 = θ1 −θ
2
2
.
Then
θ1 = γ 1 + γ 2 , θ 2 = γ 1 − γ 2
and we may phrase the testing problem in terms of γ1 , γ2 .)
115
6. Multi-Sample Problems
In this chapter, we present a unified theory of hypothesis testing based on ranks. The
theory consists of defining two sets of ranks, one consistent with the alternative and
the other consistent with the data itself in a notion to be described. The test statistic
is then constructed by measuring the distance between the two sets. Critchlow (1986,
1992) utilized a different definition for measuring the distance between sets. The problem
can embedded into a smooth parametric alternative framework which then leads to a
test statistic. It is seen that the locally most powerful tests can be obtained from this
construction. We illustrate the approach in the cases of testing for ordered as well
as unordered multi-sample location problems. In addition, we also consider dispersion
alternatives. The tests are derived in the case of the Spearman and the Hamming
distance functions. The latter were chosen to exemplify that different approaches may
be needed to obtain the asymptotic distributions under the hypotheses.
• Step 2: Define {μ} to be the set of all rankings which are equivalent to the
observed ranking μ in the sense that ranks occupied by identically distributed
random variables are exchangeable.
• Step 3: Define E to be the class of extremal rankings which are most in agreement
with the alternative. The set E is not data dependent.
Small values of d ({μ} , E) are consistent with the alternative and lead to the
rejection of the null hypothesis.
We may now integrate the unified theory in the context of the smooth alternative model.
For each fixed μ ∈ {μ}, we may define the smooth alternative density as proportional
to
π (μ; E, θ) ∼ exp [−θ d(μ, E) − K(θ)] , (6.1)
where
d (μ, E) = d (μ, ν)
ν∈E
and K (θ) is a normalizing constant. Let nμ be the cardinality of {μ} . Under the null
hypothesis, H0 : θ = 0, all the rankings μ ∈ {μ} are equally likely. Under the alternative,
H1 : θ > 0, rankings μ which are closer to E are more likely than those which are further
away. The distance model in (6.1) generalizes the distance-based models of Diaconis and
Mallows described in (Alvo and Yu (2014)) as discussed earlier in Section 4.4.
The logarithm of the composite likelihood function constructed from (6.1) is then
proportional to
[θd(μi , E) − K(θ)] = θ d (μi , E) − nμ K (θ)
μi ∈{μ} μi ∈{μ}
118
6. Multi-Sample Problems
Nk = n1 + . . . + nk , k = 1, . . . , r.
Rank all the Nr observations among themselves and write the permutation of observed
ranks
μ = [μ (1) , . . . , μ (N1 ) | . . . |μ (Nr−1 + 1) , . . . , μ (Nr )]
where ranks from the same distribution are placed together. Hence, μ (1) , . . . , μ (N1 )
represent the observed ranks of the n1 observations from F1 (x) the combined samples.
The equivalent subclass {μ} consists of all permutations of the integers 1, . . . , Nr which
assign the same set of ranks to the individual populations as μ does. On the other hand,
the extremal set E consists of all permutations which assign ranks Nk−1 + 1, . . . , Nk to
population k. Alvo and Pan (1997) derived the test statistics corresponding to the
Spearman, the Spearman Footrule, and the Hamming distances. They also obtained the
asymptotic distributions under both the null and alternative hypotheses.
In order to illustrate the methodology, we consider the two-sample case. Suppose
that we observe independent random variables X1 , X2 , X3, X4 with X1 , X2 from F1 and
X3, X4 from F2 . The alternative hypothesis claims that F2 ≤ F1 . Among all rankings of
the integers (1, 2, 3, 4) one would expect that the small ranks would be most consistent
with F1 and larger ranks would be most consistent with F2 . Adopting the convention
119
6. Multi-Sample Problems
that the first two components of a ranking refer to population 1 and the next two
to population 2, the exchangeable rankings compatible with the alternative hypothesis
would be
(1, 2|3, 4) , (2, 1|3, 4) , (1, 2|4, 3) , (2, 1|4, 3) .
Returning now to the general situation, the Spearman and Hamming test statistics
are as follows.
Spearman: In the case of the Spearman distance, the test statistic in the multi-sample
case was shown to be
Nr
μ (i)
S= c (i) ,
i=1
Nr + 1
where for 1 ≤ k ≤ r,
We note that
r Nk
k=1 i=Nk−1 +1 (Nk−1 + Nk )
c̄ =
Nr
r
k=1 nk (Nk−1 + Nk )
= (6.3)
Nr
r
k=1 (Nk − Nk−1 ) (Nk−1 + Nk )
=
Nr
r 2
k=1 N k − N 2
k−1
= = Nr . (6.4)
Nr
We recognize that S is a simple linear rank statistic and consequently, under the null
hypothesis, S is asymptotically normal with mean
Nr
μ (i)
i=1 Nr c̄
c̄ =
Nr + 1 2
and variance
Nr3
r
2
σ = wk Wk Wk−1 ,
12 k=1
nk k
→ w k , Wk ≡ wi .
Nr i=1
120
6. Multi-Sample Problems
The test rejects for large values of S. In the two-sample case, the statistic becomes:
Example 6.1. The statistic in the two-sample ordered location problem based on the
Spearman distance is
n1 (n1 + n)
n1 n1 +n2
S = μ (i) + μ (i)
n + 1 i=1 n + 1 i=n +1
1
n1 n n
= + W
2 n+1
where W is the Wilcoxon statistic for which the exact mean and variance from Sec-
tion 5.3.3 are
n2 (n + 1) n1 n2 (n + 1)
E [W ] = , V ar [W ] = .
2 12
Hamming: In the case of the Hamming distance, the test statistic in the multi-sample
case was shown to be
Nr
H= aiμ(i) ,
i=1
where
1
nk
i.j ∈ {Nk−1 + 1, . . . , Nk } , 1 ≤ k ≤ r
aij =
0 otherwise
Equivalently, we may express Hamming’s statistic as
Y1 Yr
H= + ... + ,
n1 nr
where Yk is the number of observed rankings in the set {Nk−1 + 1, . . . , Nk }. The test
rejects for large values of H. We may apply Hoeffding‘s theorem (Section 3.3) to obtain
a central limit theorem for the Hamming statistic under the null hypothesis. We have
that H is asymptotically normal with mean
1
Nr Nr
aij = 1
Nr i=1 j=1
and variance
1 r
N rN
(r − 1)
d2ij = , (6.5)
Nr − 1 i=1 j=1 Nr − 1
where
1 1 1 1
dij = aij − āi. − āij + ā.. = aij − − + = aij − .
Nr Nr Nr Nr
121
6. Multi-Sample Problems
max1≤i.j≤Nr d2ij
1
Nr2
max1≤k≤r t−2
k
1
Nr Nr 2 ≤ r−1 → 0.
Nr i=1 j=1 dij Nr
Example 6.2. The statistic in the two-sample ordered location problem based on Ham-
ming’s distance becomes
n
1 +n2
H= aiμ(i)
i=n1 +1
where ⎧
⎪
⎪ 1
i, j ∈ {1, . . . , n1 }
⎨ n1
aij = 1
i, j ∈ {n1 + 1, . . . , n1 + n2 }
⎪ n2
⎪
⎩0 otherwise.
It follows from Hoeffding’s combinatorial central limit theorem (see Theorem 3.6), H is
asymptotically normal with mean equal to 1 and variance
1
n1 + n2 − 1
as min (n1 , n2 ) → ∞.
r!
H1 : H1h .
i=1
122
6. Multi-Sample Problems
Let Th be a test statistic for testing H0 against H1h with rejection region {Th < c} . We
define the test statistic TM for testing H0 against H1 using the following three steps:
1. Linearization: Let α̃ = (α1 , . . . , αr! ) and put
r!
TL (α̃) = αh Th .
h=1
2. Normalization: Put
TL (α̃) − E0 [TL (α̃)]
TN (α̃) = - .
V ar0 TL (α̃)
3. Minimization: Put
TM = min TN (α̃) .
α̃
The use of this approach in the case of the Spearman distance leads to the well-known
Kruskal-Wallis test statistic for the one-way analysis of variance. Details of the proof of
the next two theorems may be found in Alvo and Pan (1997).
Theorem 6.1 (Spearman Case). The test statistic in the unordered case based on Spear-
man distance rejects the null hypothesis for large values of TM = μ̄Σ−1
S μ̄ where
with
Nk
μ̄k = μ (i)
i=Nk−1 +1
and inverse
12 1 1
ΣS−1 = diag ,... + J/Nr
Nr (Nr + 1) n1 nr
Moreover, under the null hypothesis as min {n1 , . . . , nr } → ∞,
L
TM −
→χ2r−1 .
123
6. Multi-Sample Problems
r 2
12 μ̄k Nr + 1
nk − .
Nr (Nr + 1) k=1 nk 2
Proof. The proof is a direct consequence of the multivariate central limit theorem.
Theorem 6.2 (Hamming Case). The test statistic in the unordered case based on Ham-
ming distance rejects the null hypothesis for large values of TM = μ̄Σ−1
H μ̄ where
with
Nk
μ̄kp = apμ(j)
j=Nk−1 +1
and
1 j ∈ {Ni−1 , . . . , Ni }
aij =
0 otherwise
expectation
E0 [μ̄kp ] = nk
and covariance
⎧
⎪
⎪
nk np (Nr −nk )(Nr −np )
k = k , p = p
⎪
⎪ Nr2 (Nr −1)
⎪
⎨− p (Nr −nk )np
n n
k
Nr2 (Nr −1)
k = k , p = p
ΣH = Cov (μ̄kp , μ̄k p ) =
⎪
⎪
n k n (N −np )nk
− Np 2 (Nr r −1) k = k , p = p
⎪
⎪
⎪ r
⎩ nk2np nk np
Nr (Nr −1)
k = k , p = p
Proof. The proof is a direct consequence of the multivariate central limit theorem.
124
6. Multi-Sample Problems
intelligence scores in Appendix A. Weschler Adult Intelligence Scale scores were recorded
on 12 males listed by age groups. A look at the data reveals that the peak is located in
the 35 − 54 age group. In general, we would like to test the null hypothesis that there is
no difference due to age against the alternative that the scores rise monotonically prior
to the peak and decrease thereafter. Two situations arise: when the location of the peak
is known and when it is unknown.
Let F1 (x) , . . . , Fr (x) be r continuous distributions and suppose we wish to test
with strict inequality for some x. Suppose that there are mi observations from Fi , i =
1, . . . , r and that n = m1 + . . . + mr .
In the case when the location of the peak is known, Alvo (2008) obtained the test
statistic corresponding to both Spearman and Kendall distance functions and showed
that they were asymptotically equivalent under the condition that
min (mi ) −→ ∞
i
where μ̄i represents the average of the ranks in the ith population. Under the null
hypothesis, Sp is asymptotically normal with mean 0 and variance
p 2 2
2 r
m [n (n + 1)] i r + 1 r + 1 − i r + 1
σp2 = − + − .
12 i=1
p 2r i=p+1
r + 1 − p 2r
Alvo (2008) also developed a test when the location of the peak is unknown. This
is based on the statistic
Sp
Smax = max
p σp
125
6. Multi-Sample Problems
FY (x) = FX (μ + (x − μ) /γ) ,
H0 : γ ≥ 1
H1 : γ < 1
The alternative states that the second population is less spread out than the first. We
shall assume for simplicity of the presentation that both sample sizes are even numbers
with
ni = 2mi , i = 1, 2.
Using the unified theory to the dispersion problem, rank all the observations together
and let items 1, . . . , n1 be from the first population and items n1 + 1, . . . , n1 + n2 be
from the second. The equivalence class {μ} consists of all permutations obtained by
permuting the labels assigned to the items in each population respectively. Moreover,
in view of the assumption that the medians are the same, we also transpose the items
ranked in positions i and n + 1 − i. The extremal class E consists of all permutations
which rank the items from the second population in the middle and the items from the
first population at the two ends. This is because the first population is more diverse.
Using the Spearman distance, it can be shown that the test statistic takes the form
n1
n + 1
S= Ri −
2
i=1
which is precisely the Freund-Ansari-Bradley test (Hájek and Sidak (1967), p. 95).
126
6. Multi-Sample Problems
Suppose now that we observe a random sample from each of the r populations:
{Xij , j = 1, . . . , ni }, i = 1, . . . , r. The composite log likelihood function for the ith
population is proportional to
ni
nl
r
r
θi sgn (xij − xlj ) − ni nl K (θi )
l
=i j=1 j =1 i=1 l
=i
and hence the composite log likelihood function taking into account all the populations
is proportional to
r
ni
nl
r
r
l (θ) ∼ θi sgn (xij − xlj ) − ni nl K (θi ) .
i=1 l
=i j=1 j =1 i=1 l
=i
H0 : θi = θ, i = 1, . . . , r (6.6)
H1 : θi = θj , for some i = j. (6.7)
127
6. Multi-Sample Problems
r
r
r
r
θk sgn (xi − xl ) = θ sgn (xi − xl ) = 0
i=1 l
=i i=1 l
=i
which shows that the model can actually be specified by using only (r − 1) parameters.
Consequently, we may redefine the parameters so that θi = 0, and we wish to test
H0 : θi = 0, i = 1, . . . , r. It follows that K (0) = 0.
Suppose now that we observe a random sample from each of the r populations:
Xij , j = 1, . . . , ni , i = 1, . . . , r. The composite log likelihood function is proportional to
r
ni
nl
r
r
l (θ) ∼ θi sgn (xij − xlj ) − ni nl K (θ)
i=1 l
=i j=1 j =1 i=1 l
=i
Since,
1
r l n
n+1
Rij − = sgn (xij − xlj )
2 2 l
=i j =1
we have
r ni
r r
n+1
l (θ) ∼ 2 θi Rij − − ni nl K (θ)
i=1 j=1
2 i=1 l
=i
r r r
n+1
= 2 ni θi R̄i − − ni nl K (θ)
i=1
2 i=1 l
=i
where R̄i is the average of the overall ranks assigned to the k th population. The score
function is given by the r × 1 vector
n+1 n+1
U = n1 R̄1 − , . . . , nr R̄r −
2 2
Kruskal (1952) and it follows that the Rao score test statistic, denoted KW , is
KW = U I −1 U
r 2
12 n+1
= ni R̄i − (6.8)
n (n + 1) i=1 2
128
6. Multi-Sample Problems
We reject the null hypothesis for large values of KW. The Kruskal-Wallis test is thus
locally most powerful for testing (6.6).
When there are ties in the data, we may adjust the ranks by using the mid-ranks
for the tied data as we did for the Wilcoxon test. The permutation method can also
be applied to the KW statistic. However, in order to maintain the chi-square approxi-
mation, the KW statistic should be modified. Recall that under the one-way ANOVA
model, SSB/σ 2 ∼ χ2 (r − 1) under H0 , and the mean of the ranks is still n+1
2
no matter
whether there are ties or not. It is thus natural to assume that for some constant C
r n+1 2
KWties ≡ C ni (R̄i − )
i=1 2
E(KWties ) = r − 1.
n+1 2 (n − ni )σ 2
E(R̄i − ) = V ar(R̄i ) =
2 (n − 1)ni
where
g
(t3i − ti )
n 2
− 1
σ2 = − i=1
12 12n
is the population variance of the combined ranks or adjusted ranks and g is the number
of tied groups. Hence. we have,
r n+1 2 nσ 2
E(KWties ) = E[C ni (R̄i − ) ] = C(r − 1)
i=1 2 n−1
with
n−1
C = .
nσ 2
Thus, a more general form of the KW statistic in the case of ties is given by
KW
KWties = g .
i=1 (ti − ti )
3
1−
n3 − n
129
6. Multi-Sample Problems
Remark. Note that this is only an intuitive derivation of the KW statistic in the case of
ties. For a more formal proof, see Kruskal and Wallis (1952). The Kruskal-Wallis test
can be implemented using either of the R functions below:
The parametric paradigm allows for a further investigation into possible differences
among the populations. We may conduct a bootstrap study by sampling with replace-
ment ni observations from the ith population and computing the overall average ranking
R̄i . We note that the maximum composite likelihood equation for each θi is
n+1 ∂K (θi )
R̄i − =
2 ∂θi
1 1
= Pθ (Xi > Xl ) − (6.9)
ni l
=i 2
Hence after each bootstrap iteration, we may obtain an estimate of the left-hand side
of (6.9). The histogram of bootstrapped values under the null hypothesis should be
centered around 0. An example illustrates this computation.
Example 6.3. Eighteen lobsters of the same size in a species are divided randomly into
three groups and each group is prepared by a different cook using the same recipe. Each
prepared lobster is then rated on each of the criteria of aroma, sweetness, and brininess
by professional taste testers. The following shows the combined scores for the lobsters
prepared by the three cooks. A higher score represents a better taste of the lobster.
Based on the data, apply the Kruskal-Wallis test to test the null hypothesis that
median scores for all three cooks are the same at the 5% level of significance.
130
6. Multi-Sample Problems
where Rijk is the rank of Xijk among the ith row. Similarly,
1
I n
Cijk 1
sgn (xijk − xi jk ) = − ,
nI + 1 i =1 k =1 nI + 1 2
where Cijk is the rank of Xijk among the jth column. Placing this problem within
the context of a smooth model, the composite log likelihood function viewed from the
perspective of rows only is proportional to
n
J
n
l (θ) ∼ θij sgn (xijk − xij k ) − n2 J K (θ)
ij k=1 j =1 k =1
n
(nJ + 1)
∼ θij Rijk − − n2 J K (θ)
ij k=1
2
n
Rijk
∼ θij − nK (θ) .
ij k=1
nJ + 1
131
6. Multi-Sample Problems
Similarly, the composite log likelihood function viewed from the perspective of columns
only is proportional to
n
Cijk
l (θ) ∼ θij − nK (θ) .
ij k=1
nI + 1
Let
Rijk Cijk
+xijk = .
nJ + 1 nI + 1
Consequently, the composite log likelihood from the perspective of both rows and
columns is proportional to the sum
n n
Rijk Cijk
θij + − K (θ) ∼ θij xijk − nK (θ)
ij k=1
nJ + 1 nI + 1 ij k=1
∼ n θij x̄ij. − K (θ) .
ij
Using dots to denote an average over that index, set the parameter space
where
1
αi = (θi. − θ.. ) , βj = (θ.j − θ.. ) , θ.. = θij
IJ
and
γij = θij − θi. − θ.j + θ..
Here, αi , βj , γij represent respectively the row, column, and interaction effects. It can
be seen that
ij θij x̄ij. =
θ.. x̄... + J αi (x̄i.. − x̄... ) + I βj (x̄.j. − x̄... ) + γij (x̄ij. − x̄i.. − x̄.j. + x̄... )
i j ij
∂l (θ) ∂K (θ)
= [x̄ij. − x̄i.. − x̄.j. + x̄... ] −
∂γij ∂γij
132
6. Multi-Sample Problems
n−1 U Σ− U (6.10)
Chapter Notes
1. Gao and Alvo (2005b) provide a brief historical look at the analysis of unbalanced
two-way layout with interaction effects. Using the notion of a weighted rank, they
present tests for both main effects and interaction effects. In addition, there is
a discussion of the asymptotic relative efficiency of the proposed tests relative to
the parametric F test. Various simulations further exemplify the power of the
proposed tests. In a specific application, it is shown that the test statistic is the
most robust in the presence of extreme outliers compared to other procedures.
2. Gao et al. (2008) also consider nonparametric multiple comparison procedures for
unbalanced one-way factorial designs whereas Gao and Alvo (2008) treat nonpara-
metric multiple comparison procedures for unbalanced two-way layouts.
133
6. Multi-Sample Problems
3. Alvo and Pan (1997) considered the two-way layout with an ordered alternative
using the tools of the unified theory. It was seen that the Spearman’s distance
induced the Page statistic (Page, 1963). The statistic induced by Hamming’s
distance was new (see also Schach (1979)). Alvo and Cabilio (1995) considered
the two-way layout with ordered alternatives when the data within blocks may be
incomplete and they obtained generalizations of the Page and Jonckheere statistics.
Cabilio and Peng (2008) considered a multiple comparison procedure for ordered
alternatives when the data are incomplete which maintains the experimentwise
error rate at a preassigned level.
6.5. Exercises
Exercise 6.1. Consider the randomized block experiment given by the model
Xij = bi + τj + eij , i = 1, . . . , n; j = 1, . . . , t
where bi is a block effect, τj is a treatment effect, and {eij } are independent identically
distributed error terns having a continuous distribution. Use the unified theory of hy-
pothesis testing to obtain the statistic that corresponds to the Spearman measure of
distance in order to test
H 0 : τ1 = τ2 = . . . = τ t
against the ordered alternative
H 1 : τ1 ≤ τ2 ≤ . . . ≤ τ t
with at least one strict inequality. (See Alvo and Cabilio (1995).)
Exercise 6.2. Suppose that one observes at times t1 < t2 < . . . < tk a random sample
of ni binary variables {yij } taking values 1 or 0 with unknown probabilities θi , 1 − θi
respectively. Use the unified theory of hypothesis testing to obtain the statistic that
corresponds to the Spearman measure of distance in order to test the null hypothesis of
homogeneity
H 0 : θ1 = θ2 = . . . = θ k
against the ordered alternative
H 1 : θ1 ≤ θ2 ≤ . . . ≤ θ k
with at least one strict inequality. (See Alvo and Berthelot (2012).)
134
6. Multi-Sample Problems
Exercise 6.3. Using the following coded data on drug toxicity during 5 hours, test for
an umbrella alternative when the peak toxicity is assumed to be during the 12:00–13:00
time period.
135
7. Tests for Trend and Association
In this chapter, we consider additional applications of the smooth model paradigm de-
scribed earlier in Chapter 4. We begin by considering tests for trend. We then proceed
with the study of the one-sample test for a randomized block design. We obtain a dif-
ferent proof of the asymptotic distribution of Friedman’s statistic based on Alvo (2016)
who developed a likelihood function approach for the analysis of ranking data. Further,
we derive a test statistic for the two-sample problem as well as for problems involving
various two-way experimental designs. We exploit the parametric paradigm further by
introducing the use of penalized likelihood in order to gain further insight into the data.
Specifically, if judges provide rankings of t objects, penalized likelihood enables us to
focus on those objects which exhibit the greatest differences.
where g(h(x, y)) = 12 , x = y and I (A) = 1 if the event A occurs and = 0 otherwise with
eθ + 1
eK(θ) = .
2
The kernel h (x, y) may be used to compare any two values along the sequence of the
observations and is a measure of the slope. The case when θ = 0 corresponds to the
situation when there is trend. We may construct the composite likelihood function
" "
L(θi ) = π((h (xi , xj )) ; θi ) = exp[θi j=i I(xi <xj )−(n−1)K(θi )] (g (h (xi , xj ))) .
j
=i j
=i
The choice of kernel function is motivated by the fact that in testing for an increasing
trend, we should focus on observations to the right of the present observation. It is seen
that
I (Xi < Xj ) = Ri − 1,
j
=i
where Ri is the rank of Xi among X1 , . . . , Xn . Hence, the log of the composite likelihood
is proportional to
"
n
(θ) = log Li (θi )
i=1
n
n
∼ θi R i − θi − (n − 1) K (θi ) .
i=1 i=1
138
7. Tests for Trend and Association
which rejects for large absolute values. It should be noted that with this same approach
we can also test for quadratic trends by choosing
2
n+1
θi = β i − .
2
The statistic in (7.1) is a well-known test of trend. Since it is a linear rank statistic, its
asymptotic distribution can be easily obtained from Theorem 3.2. In fact, we have that
for large n n Ri
i=1 i − n+1
n+1 L
2
−
→→ N (0, 1)
σ
where 2
1 n+1 n (n2 − 1)
2
σ = i− =
12 2 12
Yu et al. (2002) obtained a generalization of the trend statistic in the presence of ties.
Example 7.1. In Appendix A.7, precipitation data for Saint John, New Brunswick,
Canada was analyzed for the period 1894–1991 using the Spearman statistic (Alvo and
Cabilio (1994)). The Z-score for St John was calculated to be 2.08 indicating there is
an increasing trend.
p = (p1 , . . . , pt! ) ,
H0 : p = p0 vs H1 : p = p0 . (7.2)
where p0 = t!1 1.
Define a k-dimensional vector score function X (ν) on the space P and let its smooth
probability mass function be given as
1
π(xj ; θ) = exp (θ xj − K(θ)) , j = 1, . . . , t! (7.3)
t!
139
7. Tests for Trend and Association
t1
π (xj ; θ) = 1
j=1
it can be seen that K(0) = 0 and hence the hypotheses in (7.2) are equivalent to
H0 : θ = 0 vs H1 : θ = 0. (7.4)
l (θ) ∼ n [θ η̂ − K(θ)] ,
where
t!
nj
η̂ = xj p̂nj , p̂nj =
j=1
n
and nj represents the number of observed occurrences of the ranking ν j . The Rao score
statistic evaluated at θ = 0 is
∂
U (θ; X) = n [θ η̂ − K(0)]
∂θ
∂
= n η̂ − K(0)
∂θ
where χ2f (α) is the upper 100 (1 − α) % critical value of a chi-square distribution with
f = rank(I (θ)) degrees of freedom. We note that the test just obtained is the locally
most powerful test of H0 . In the next section, we specialize this test statistic and consider
the score functions of Spearman and Kendall.
140
7. Tests for Trend and Association
Theorem 7.1. Under the null hypothesis for the Spearman scores,
1 t+1
Cov (X) = T S T S = [tI − J t ] . (7.5)
t! 12
− 12
(T S T S ) = [I + J t ] . (7.6)
t (t + 1)
Next, we demonstrate that the Rao score statistic is the well-known Friedman test
(Friedman, 1937).
Theorem 7.2. Under the null hypothesis, the Rao score statistic is asymptotically χ2t−1
and is given by
t 2
12n t+1
W = R̄i − , (7.7)
t (t + 1) i=1 2
where R̄i is the average of the ranks assigned to the ith object.
t!
η̂ = xj p̂nj
j=1
= T S p̂n .
where
p̂n = (p̂nj )
141
7. Tests for Trend and Association
Eθ [X] = 0,
T K = (tK (ν 1 ) , . . . , tK (ν t! ))
Theorem 7.3. Under the null hypothesis for the Kendall scores,
142
7. Tests for Trend and Association
1
Cov (X) = T K T K
t!
whose entries A (s, s , t, t ) = t!1 Σν sgn (ν (s) − ν (t)) sgn (ν (s ) − ν (t )) are given by
⎧
⎪
⎪ 0 s = s , t = t
⎪
⎪
⎨1 s = s , t = t
A (s, s , t, t ) = 1 .
⎪
⎪ s = s , t = t
⎪
⎪ 3
⎩ 1
−3 s = t , s = t
t−1
Moreover, the eigenvalues of Cov (X) are 13 , t+1
3
with multiplicities 2 , (t − 1)
respectively.
Proof. Part 1 follows from Lemma 4.1 in (Alvo and Yu (2014), p. 58). Part 2 follows by
direct calculation.
The inverse matrix can be readily computed even for values of t = 10.
143
7. Tests for Trend and Association
Theorem 7.4. Under the null hypothesis, the Rao score statistic for the Kendall scores
is asymptotically χ2t as n → ∞ and is given by
(2 )
−1
n (T K p̂n ) (T K T K ) (T K p̂n ) . (7.9)
1 ( )
n (T K p̂n ) (T K p̂n ) −
L
→ (t + 1) χ 2
t−1 + χ (2 ) − 1
2
t−1
3 (t2 )
The summation is over all (t2 ) pairs of objects and #i is the number of judges whose
ranking of the pair i of objects agrees with the ordering of the same pair in a criterion
ranking such as the natural ranking. The distribution of the Kendall statistic (7.9)
is simpler though its form is somewhat more complicated. In the alternate form, the
reverse is true.
In this section, we have seen that we can derive some well-known statistics through
the parametric paradigm and that these are locally most powerful. We proceed next to
show that a similar result can be obtained using Hamming score functions.
T H = (tH (ν 1 ) , . . . , tH (ν t! )) .
Theorem 7.5. Under the null hypothesis for the Hamming scores,
(a) the covariance function of X is given by
1 J J
Γ = Cov (X) = I− ⊗ I−
t−1 t t
144
7. Tests for Trend and Association
(t − 1) n (T H p̂n ) (T H p̂n ) −
L
→ χ2(t−1)2 .
This test was first introduced by Anderson (1959) and rediscovered by Kannemann
(1976). Schach (1979) obtained the asymptotic distribution of the statistic based on
Hamming distance under the null hypothesis and under contiguous alternatives making
use of Le Cam’s third lemma. Alvo and Cabilio (1998) extended the statistic to include
various block designs.
where θ l = (θl1 , . . . , θlt ) represents the vector of parameters for population l. We are
interested in testing
H0 : θ 1 = θ 2 vs H1 : θ 1 = θ 2 .
The probability distribution {pl (j)} represents an unspecified null situation. Define
nl1 nlt!
p̂l = ,..., ,
nl nl
θ l = m + bl γ,
where
n1 θ 1 + n 2 θ 2 n2, n1
m= , b1 = , b2 = − .
n1 + n2 n1 + n2 n1 + n2
145
7. Tests for Trend and Association
Σl = Πl − pl pl ,
where Πl = diag (pl (1) , . . . , pl (t!)) and pl = (pl (1) , . . . , pl (t!)) . The logarithm of the
likelihood L as a function of (m, γ) is proportional to
2
t!
+ ,
log L (m, γ) ∼ nlj (m + bl γ) xj − K (θ l ) .
l=1 j=1
Theorem 7.6. Consider the two-sample ranking problem whereby we wish to test
H0 : θ 1 = θ 2 vs H1 : θ 1 = θ 2 .
Proof. The Rao score vector evaluated under the null hypothesis is given by
∂ log L (m, γ) n 1 n2
= (T S p̂1 − T S p̂2 )
∂γr n1 + n2
The result of this section was first obtained in Feigin and Alvo (1986) using notions
of diversity. It is derived presently through the parametric paradigm. See Chapter 4.2
of Feigin and Alvo (1986) for a discussion on the efficient calculation of the test statistic.
The parametric paradigm can also be used to deal with the two-sample mixture
problem using a distribution expressed as
π (X 1 , X 2 ; θ 1 , θ 2 ) = λπ (X 1 ; θ 1 ) + (1 − λ) π (X 2 ; θ 2 ) , 0 < λ < 1.
In that case, the use of the EM algorithm can provide estimates of the parameters (see
Casella and George (1992)).
146
7. Tests for Trend and Association
t
Λ(θ, c) = −θ nj xj + nK(θ) + λ( θi2 − c) (7.11)
j=1 i=1
represent the penalizing function for some prescribed values of the constant c. We shall
assume for simplicity that xj = 1. When t is large (say t ≥ 10), the computation of
the exact value of the normalizing constant K(θ) involves a summation of t! objects.
McCullagh (1993) noted the resemblance of (7.3) to the continuous von Mises-Fisher
density
t−3
θ 2
f (x; θ) = t−3 t−1
exp (θ x) ,
2 t!I (θ)Γ( 2 )
2 t−3
2
where θ is the norm of θ, x is on the unit sphere, and Iυ (z) is the modified Bessel
function of the first kind given by
∞
1
z 2k+ν
Iυ (z) = .
k=0
Γ(k + 1)Γ(υ + k + 1) 2
147
7. Tests for Trend and Association
likelihood estimate of θ from each bootstrap sample. Repeating this procedure 10,000
times leads to the bootstrap distribution for θ. In this way, we can draw useful inference
from the distribution θ and in particular construct two-sided confidence intervals for its
components. We applied this to a data set with t = 3. Define the probabilities of the
rankings
and
p1 + p2 = Pr(giving rank 1 to item 1)
which compares the probabilities of assigning the lowest and highest rank to object 1.
The other components make similar comparisons for the other objects.
Example 7.2. Sutton Data (One-Sample Case)
C. Sutton considered in her 1976 thesis, the leisure preferences and attitudes on retire-
ment of the elderly for 14 white and 13 black females in the age group 70–79 years.
Each individual was asked: with which sex do you wish to spend your leisure time?
Each female was asked to rank the three responses: male(s), female(s), or both, as-
signing rank 1 for the most desired and 3 for the least desired. The first object in the
ranking corresponds to “male,” the second to “female,” and the third to “both.” To illus-
trate the approach in the one-sample case, we combined the data from the two groups
as in Table 7.1.
We applied the method of penalized likelihood in this situation and the results are
shown in Table 7.2. To better illustrate our result, we rearrange our result (unconstrained
θ, c = 1) and data as Table 7.3. It can be seen that θ1 is the largest coefficient and
object 1 (Male) shows the greatest difference between the number of judges choosing
rank 1 or rank 3 which implies that the judges dislike spending leisure time with males
148
7. Tests for Trend and Association
the most. For object 3 (Both), the greater value of negative θ3 implies the judges prefer
to spend leisure time with both sexes the most. θ2 is close to zero and we deduce the
judges show no strong preference on Female. This is consistent with the hypothesis that
θ close to zero means randomness. To conclude, the results also show that θi weights
the difference in probability between assigning the lowest and the top rank to object i.
A negative value of θi means the judges prefer object i more whereas a positive θi means
the judges are more likely to assign a lower rank to object i.
We plot the bootstrap distribution of θ in Figure 7.1. For H0 : θi = 0, we see that θ1
and θ3 are significantly different from 0 whereas θ2 is not. We also see that the bootstrap
distributions are not entirely bell shaped leading us to conclude that a traditional t-test
method may not be appropriate in this case.
For the Kendall score, we consider once again the Sutton data (t = 3) and apply a
penalized likelihood approach. The results are exhibited in Table 7.4.
We rearrange the Sutton data focusing on paired comparisons and the results (c = 1)
are displayed in Table 7.5. First, we note that all the θi s are negative. This is consistent
Table 7.4.: Penalized likelihood using the Kendall score function for the Sutton data
Paired comparison Choice of c
no constraint
object i object j c = 0.5 c = 1 c = 2 c = 10
1 2 θ1 -0.35 -0.49 -0.70 -1.56 -0.60
1 3 θ2 -0.56 -0.80 -1.13 -2.53 -0.97
2 3 θ3 -0.24 -0.34 -0.48 -1.08 -0.41
Λ(θ, c) 42.79 40.17 40.20 127.76 39.59
149
7. Tests for Trend and Association
0.04 0.04
0.02 0.02
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1
Distribution for θ1 Distribution for θ3
Table 7.5.: Paired comparison for the Sutton data and the estimation of θ
object i object j Number of judges Paired comparison θ
7 more prefer 1
1 2 -0.49
20 more prefer 2
3 more prefer 1
1 3 -0.80
24 more prefer 3
9 more prefer 2
2 3 -0.34
18 more prefer 3
150
7. Tests for Trend and Association
with our interpretations. The judges show a strong preference for Males to Both and
Males to Females. They least prefer Females to Both. We may conclude that the θi s
represent well the paired preferences among the judges. We applied penalized likelihood
in this situation and the results are shown in Table 7.6.
2
t!
2
t
Λ(m, γ) = − (m + bl γ) nlj xlj + nl K(m + bl γ) + λ( γi2 − c)
l=1 j=1 l=1 i=1
for some prescribed values of the constant c and λ. We continue to use the approximation
to the normalizing constant from the von Mises-Fisher distribution to approximate K(θ).
Here γi shows the difference between the two population’s preference on object i.
A negative γi means that population 1 shows more preference on object i compared to
population 2. A positive γi means that population 2 shows more preference on object i
compared to population 1. For γi close to zero, there is no difference between the two
populations on that object. As we shall see, this interpretation is consistent with the
results in the real data applications. From the definition of m, we know that m is the
common part of θ 1 and θ 2 . More specifically, m is the weight average of θ 1 and θ 2
taking into account the sample sizes of the populations.
As an application consider the Sutton data (t = 3) found in Table 7.7. Rearranging
the results for c = 1 we have the original data in Table 7.8. First, it is seen that m
is just like the θ’s in the one-sample problem. For example, m3 is the smallest value
and the whole population prefers object “Both” best. m3 is the largest and the whole
151
7. Tests for Trend and Association
population mostly dislikes object “Male.” This is not surprising since we know that m is
the common part of θ 1 and θ 2 . For the parameter γ, we note that white females prefer
to spend leisure time with Females (8 individuals assign rank 1) whereas black females
do not (6 individuals assign rank 3). We see that γ2 is negative and largest in absolute
value. There is a significant difference of opinion with respect to object 2, Female. For
objects “Male” and “Both,” black females prefer them to white females. To conclude,
the results are consistent with the interpretation of m and γ.
We conclude that the use of the parametric paradigm provided more insight on the
objects being ranked. Further details for the penalized likelihood approach are found in
Alvo and Xu (2017).
7.5.1. Compatibility
Compatibility is a concept that was introduced in connection with incomplete rankings.
Specifically, suppose that μ = (μ (1) , . . . , μ (t)) represents a complete ranking of t ob-
jects and that μ∗ = (μ∗ (o1 ) , . . . , μ∗ (ok )) represents an incomplete ranking of a subset k
of these objects where o1 , . . . , ok represent the labels of the objects ranked. Alternatively,
for an incomplete ranking, we may retain the t − dimensional vector notation and indi-
cate by a ” − ” an unranked object. Hence, the incomplete ranking μ∗ = (2, −, 3, 4, 1)
indicates that among the t = 5 objects, only object “2” is not ranked. In this notation,
complete and incomplete rankings are of the same length t.
152
7. Tests for Trend and Association
kh !
T ∗h = T Ch (7.12)
t!
whose columns are the score vectors of
Example 7.4. Let t = 3, kh = 2. The complete rankings associated with the rows are
in the order (123), (132), (213), (231), (312), (321). For the incomplete rankings (12_),
(21_) indexing the columns, the associated compatibility matrix C h is
⎡ ⎤
1 0
⎢ ⎥
⎢ 1 0 ⎥
⎢ ⎥
⎢ 0 1 ⎥
Ch = ⎢ ⎥.
⎢ 1 0 ⎥
⎢ ⎥
⎣ 0 1 ⎦
0 1
153
7. Tests for Trend and Association
⎡ ⎤
1 0
⎡ ⎤⎢ ⎥ ⎡
2 ⎤
1 ⎢ ⎥
1 0
−1 −1 0 0 1 ⎢ ⎥ − 23
1 ⎢ 0 1 ⎥ ⎣ 2 3
T ∗h = ⎣ 0 1 −1 1 −1 ⎦
0 ⎢ ⎥= − 2 ⎦
.
3 ⎢ 1 0 ⎥ 3 3
1 0 1 −1 0 −1 ⎢ ⎥ 0 0
⎣ 0 1 ⎦
0 1
For completeness, we may also extend the notion of compatibility to tied rankings
defined as follows.
Definition 7.2. A tied ordering of t objects is a partition into e sets, 1 ≤ e ≤ t, each
containing di objects, d1 + d2 + . . . + de = t, so that the di objects in each set share
the same rank i, 1 ≤ i ≤ e. Such a pattern is denoted δ = (d1 , d2 , . . . , de ). The ranking
denoted by μδ = (μδ (1) , . . . , μδ (t)) resulting from such an ordering is called a tied
ranking and is one of d1 !,d2t!!,...,de ! possible permutations.
For example, if t = 3 objects are ranked, it may happen that objects 1 and 2 are
equally preferred to object 3. Consequently, the rankings (1, 2, 3) and (2, 1, 3) would
both be plausible and should be placed in a “compatibility” class. The average of the
rankings in the compatibility class which results from the use of the Spearman distance
would then yield the ranking
1
[(1, 2, 3) + (2, 1, 3)] = (1.5, 1.5, 3) .
2
It is seen that this notion of compatibility for ties justifies the use of the mid-rank when
ties are present. Associated with every tied ranking, we may define a t!× d1 !,d2t!!,...,de ! matrix
of compatibility. Yu et al. (2002) considered the problem of testing for independence
between two random variables when the pattern for ties and for missing observations
are fixed.
154
7. Tests for Trend and Association
In the study of the asymptotic behavior of various statistics for such problems, we
consider n replications of such basic designs. For any incomplete kh -ranking μ∗ , define
the score vector of
μ∗ = (μ∗ (1) , μ∗ (2) , . . . , μ∗ (kh ))
as μ∗ ranges over each of the kh ! permutations (Alvo and Cabilio, 1991). From Alvo
and Cabilio (1991), for a given permutation of (1, 2, . . . , kh ), indexed by s = 1, 2, . . . , kh !,
define the vector xh(s) whose ith entry is given by
(t + 1) ∗ kh + 1 (t + 1) ∗ t+1
μh(s) (i) − δh (i) = μ (i) − δh (i), (7.13)
(kh + 1) 2 (kh + 1) h(s) 2
δh (i) is either 1 or 0 depending on whether the object i is, or is not, ranked in block h,
and μ∗h(s) (i), as defined above, is the rank of object i for the permutation indexed by s
for block pattern h. This is also the corresponding ith row element of column s of
kh !
T ∗h = T C h.
t!
An (i, j) element of kt!h ! T S C h kt!h ! T S C h is thus of the form
kh !
(t + 1) ∗ t+1 (t + 1) ∗ t+1
μ (i) − μ (j) − δh (i)δh (j),
s=1
(kh + 1) h(s) 2 (kh + 1) h(s) 2
For a specific pattern of missing observations for each of the b blocks, the matrix of
scores is given by
∗ ∗ ∗ ∗ k1 ! k2 ! kb !
T = (T 1 | T 2 | . . . | T b ) = T C1 | C2 | . . . | Cb ,
t! t! t!
where the index of each block identifies the pattern of missing observations, if any, in
that block. It was shown in Alvo and Cabilio (1991) that the proposed test rejects H0
for large values of
G ≡ (T ∗ f ) (T ∗ f ) = f (T ∗ ) T ∗ f , (7.14)
b
where the h=1 kh ! dimensional vector f is the vector of frequencies, for each of the b
patterns of the observed incomplete rankings. That is, f = (f 1 | f 2 | · · · | f b ) , where
f h is the kh ! dimensional vector of the observed frequencies of each of the kh ! ranking
permutations for the incomplete pattern h = 1, 2, . . . , b. Using the fact that for these
distance measures the matrix T ∗ is orthogonal to the vector of 1’s, and proceeding in a
manner analogous to Alvo and Cabilio (1991) gives the following. Moreover, we have
155
7. Tests for Trend and Association
1 L
√ T ∗f −
→ N (0, Γ) , (7.15)
n
b
1 kh ! kh !
Γ= T Ch T Ch . (7.16)
h=1
kh! t! t!
Thus
L
n−1 G = n−1 (T ∗ f ) (T ∗ f ) −
→ αi zi2 , (7.17)
where {zi } are independent identically distributed normal variates, and {αi } are the
eigenvalues of Γ.
In the Spearman case, for a given permutation of (1, 2, . . . , kh ), indexed by s =
1, 2, . . . , kh !, the corresponding ith row element of column s of kt!h ! T S C h is found to be
(t + 1) ∗ kh + 1 (t + 1) ∗ t+1
μh(s) (i) − δh (i) = μ (i) − δh (i), (7.18)
(kh + 1) 2 (kh + 1) h(s) 2
δh (i) is either 1 or 0 depending on whether the object i is, or is not, ranked in block h,
and μ∗h(s) (i), as defined above, is the rank of object i for the permutation indexed by s
for block pattern h.
Lemma 7.1. In the Spearman case we have that the t × t matrix
b b
1 kh ! kh !
Γ= T SCh T SCh = γh2 Ah
h=1
kh! t! t! h=1
where ⎧
kh −1
⎪
⎨ kh
δh (j) on the diagonal,
Ah =
⎪
⎩ − 1 δ (j) δ (j ) off the diagonal,
kh h h
h
(t+1) 2
and γh2 = (kh1−1) kj=1 (kh +1)
j − (t+1)
2
. The elements of Γ are thus
⎧
⎪ 2 1 b kh −1
⎪
⎨ (t + 1) 12 h=1 δ (j)
kh +1 h
on the diagonal,
(7.19)
⎪
⎩ (t + 1)2 − 1 b
⎪ 1
δ (j)δh (j ) off the diagonal.
12 h=1 kh +1 h
Note that the elements of each row of Γ sum to 0, so that rank (Γ) ≤ t − 1.
156
7. Tests for Trend and Association
The matrix Γ with elements given in Lemma 7.1 is closely related to the information
matrix of a block design. John and Williams (1995) details how this matrix occurs in
the least squares estimation of treatment effects and the role its eigenvalues play in
determining optimality criteria for choosing between different designs. This information
matrix A has components as follows:
⎧ b
kh −1
⎪
⎨ h=1 kh δh (i) on the diagonal,
(7.20)
⎩ − b
⎪ 1
h=1 kh δh (i)δh (j) off the diagonal.
The smooth alternative for the entire block design will be the product of the models for
each block. We are interested in testing
H0 : θ h = 0, for all h,
H1 : θ h = 0, for some h.
The log likelihood, derived from the joint multinomial probability function, is given by
b
kh !
l (θ) ∼ nh(s) log π θ h ; xh(s)
h=1 s=1
k !
b h
b
kh !
= θ h nh(s) xh(s) − nh(s) K (θ h )
h=1 s=1 h=1 s=1
where nh(s) represents the frequency of occurrence of the value xh(s) . Under this formu-
lation, using a similar argument as in Section 7.2, we can show that the score test leads
to the result in Theorem 7.7. Two examples of block designs are considered below.
157
7. Tests for Trend and Association
block. An example of a test in such a situation is the Friedman Test with test statistic
t
2
n (t + 1)
GS = Ri − ,
i=1
2
where Ri is the sum of the ranks assigned by the judges to object i. Under H0 , as
n → ∞,
1 L t (t + 1)
GS −
→ χ2t−1 .
n 12
An interpretation of the Friedman statistic is that it is essentially the average of the
Spearman correlations between all pairs of rankings. In the balanced incomplete block
design, we have kh = k, ri = r, λij = λ, bk = rt, and λ (t − 1) = r (k − 1) . An example
of a test in such a case is due to Durbin whose test statistic is
t
2
(t + 1) nr (t + 1)
GS = Ri − .
i=1
(k + 1) 2
Under H0 , as n → ∞,
1 L λt(t + 1)2
GS −
→ χ2t−1 .
n 12 (k + 1)
bk = rt, r (k − 1) = (d − 1) λ1 + d (g − 1) λ2 .
158
7. Tests for Trend and Association
7.6. Exercises
Exercise 7.1. 1. Calculate E [W ] under the null hypothesis.
t
n
j=1 ij − 3n (t + 1) and then use the
12 2
Hint: First show that W = nt(t+1) i=1 R
properties of the ranks from Lemma 3.1 under the null hypothesis.
Exercise 7.3. Some evidence suggests that anxiety and fear can be differentiated from
anger according to feelings along a dominance-submissiveness continuum. In order to
determine the reliability of the ratings on a sample of animals, the ranks of the ratings
given by two observers, Ryan and Jacob, were tabulated below. Perform a suitable test
at the 5% significance level whether Ryan and Jacob agree on their ratings.
159
7. Tests for Trend and Association
Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ryan’s Ranks 4 7 13 3 15 12 5 14 8 6 16 2 10 1 9 11
Jacob’s Ranks 2 6 14 7 13 15 3 11 8 5 16 1 12 4 9 10
Exercise 7.4. It is widely believed that the Earth is getting warmer year by year,
caused by the increasing amount of greenhouse gases emitted by human, which are trap-
ping more heat in the Earth’s atmosphere. However, some people cast doubts on that,
claiming that the average global temperature actually remains stable over time whereas
extreme weathers occur more and more frequently. In order to test these claims, the
data on the annual mean temperature (M EAN , in ◦ C), the annual maximum (M AX)
and minimum (M IN ) temperatures (in ◦ C) in Hong Kong for the years (Y EAR) 1961
to 2016 are extracted from the Hong Kong Observatory website, see Appendix A.8.
(a) Test whether the annual mean temperature M EAN shows an increasing trend at
the 5% significance level using (i) Spearman test and (ii) Kendall test.
(c) Based on the results found in (a) to (b), draw your conclusions and discuss the
limitation of your analysis.
Exercise 7.5. Eight subjects were given one, two, and three alcoholic beverages at
different widely space times. After each drink, the closest distance in feet they would
approach an alligator was measured. Test the hypothesis that there is no difference
between the different amounts of drinks consumed.
1 2 3 4 5 6 7 8
One drink 19.0 14.4 18.2 15.6 14.6 11.2 13.9 11.6
Two drinks 6.3 11.6 9.7 5.9 13.0 9.8 4.8 10.7
Three drinks 3.3 1.2 3.7 7.1 2.6 1.9 5.2 6.4
Exercise 7.6. There are 6 psychiatrists who examine 10 patients for depression. Each
patient is seen by 3 psychiatrists who provide a score as shown below. Analyze the data
using a nonparametric balanced incomplete block design. Are there differences among
the psychiatrists?
160
7. Tests for Trend and Association
Patient 1 2 3 4 5 6
1 10 14 10
2 3 2 1
3 7 12 9
4 3 8 5
5 20 26 20
6 20 14 20
7 5 8 14
8 14 18 15
9 12 17 12
10 18 19 13
161
8. Optimal Rank Tests
Lehmann and Stein (1949) and Hoeffding (1951b) pioneered the development of an op-
timal theory for nonparametric tests, parallel to that of Neyman and Pearson (1933)
and Wald (1949) for parametric testing. They considered nonparametric hypotheses
that are invariant under permutations of the variables in multi-sample problems1 so
that rank statistics are the maximal invariants, and extended the Neyman-Pearson and
Wald theories for independent observations to the joint density function of the maximal
invariants. Terry (1952) and others subsequently implemented and refined Hoeffding’s
approach to show that a number of rank tests are locally most powerful at certain alter-
natives near the null hypothesis. We shall first consider Hoeffding’s change of measure
formula and derive some consequences with respect to the two-sample problem. This
formula assumes knowledge of the underlying distribution of the random variables and
leads to an optimal choice of score functions and subsequently to locally most powerful
tests. Hence, for any given underlying distributions, we may obtain the optimal test
statistic.
In previous chapters, we did not assume knowledge of the underlying distributions of
the random variables. Instead, we derived test statistics through a parametric embedding
based on either a kernel or score function. The connection between the two approaches is
now evident. Using the results of the present chapter, it will then be possible to calculate
the efficiency of previously derived test statistics with respect to the optimal for specific
underlying distribution of the variables. We postpone the discussion of efficiency to
Chapter 9 of this book.
In Section 8.1 we see that the locally most powerful tests can be included in the
unified theory discussed in Section 6.1. In Section 8.2 we discuss the regression problem,
whereas in Section 8.3 we present the optimal choice of score function in the two-way
layout.
1
Lehmann & Stein considered the case for two samples and Hoeffding for multi-samples.
Lemma 8.1 (Hoeffding’s Formula). Assume that V1 < . . . < Vn are the order statistics
of a sample of size n from a density h. Then
P (R = r) = ... f1 (v1 ) . . . fn (vn ) dv1 . . . dvn (8.1)
(v1 <...vn )
f1 (v1 ) . . . fn (vn )
= ... n! h (v1 ) . . . h (vn ) dv1 . . . dvn (8.2)
(v1 <...vn ) n!h (v1 ) . . . h (vn )
1 f1 V(r1 ) . . . fn V(rn )
= Eh . (8.3)
n! h V(r1 ) . . . h V(rn )
To illustrate the use of Hoeffding’s lemma, we consider the two-sample problem where
X1 , . . . , Xm constitute a random sample from a cdf F having density f and Y1 , . . . , Yn−m
are a random sample from a cdf G having density g. As well, assume that f = 0 implies
g = 0. Substitute in Hoeffding’s formula (8.3)
f i = 1, . . . , m
fi =
g i = m + 1, . . . , n
and h = g, we have
1 f V(r1 ) . . . f V(rm )
P (R = r) = Eg .
n! g V(r1 ) . . . g V(rm )
164
8. Optimal Rank Tests
1 f V(r1 ) . . . f V(rm )
= Eg 1
n! g V(r1 ) . . . g V(rm )
1 f V(r1 ) . . . f V(rm )
= n Eg , (8.4)
(m ) g V(r1 ) . . . g V(rm )
where V(1) < . . . < V(N ) are the order statistics from a random sample of n uniform
variables on (0, 1) and the second summation is over all possible permutations of the
first and second samples among themselves. The expectation Eg is with respect to the
probability measure under which the n observations are i.i.d. with common density
function g, assuming that g is positive whenever f is.
Corresponding to a density function f with cdf F , Hájek and Sidak (1967) considered
a general score function for location problems defined as
an (i, f ) = Eϕ Vn(i) , f , 1 ≤ i ≤ n (8.5)
n−1 1
= n i−1 ϕ (v, f ) v i (1 − v)n−i dv (8.6)
0
(1) (n)
where Vn < . . . < Vn are the order statistics from a uniform distribution on (0, 1)
and
f (F −1 (v))
ϕ (v, f ) = , 0 < v < 1.
f (F −1 (v))
Theorem 8.1 (Location Alternatives). Assume |f (x)| dx < ∞. Suppose that we have
a sample of size m from the first population and a sample of size n from the second. The
test with critical region
m
an (Ri f ) ≥ k
i=1
"
m+n
H0 : f (x1 , . . . , xn ) = fX (xi ) (8.7)
i=1
against
"
m "
m+n
H1 : f (x1 , . . . , xn ) = fX (xi − Δ) fX (xi ) , Δ > 0.
i=1 i=m+1
165
8. Optimal Rank Tests
We shall consider two examples to illustrate this theorem which results from an
application of Hoeffding’s formula. These include the Wilcoxon test when g(x) = ex /(1+
ex )2 is the logistic density and the Fisher-Yates test when g is the standard normal
density.
Example 8.1 (Wilcoxon Test). Suppose that we have a random sample of size m from
the logistic distribution whose density is given by
e−x
f (x) = , x > 0.
(1 + e−x )2
1
Then, F (x) = 1+e−x
and
log f (x) = −x − 2 log 1 + e−x
from which
∂ f ((x))
log f (x) = = −1 + 2 [1 − F (x)] .
∂x f ((x))
Hence,
f (F −1 (V ))
= −1 + 2 [1 − V ] .
f (F −1 (V ))
Recall from Lemma 2.1 that the ith order statistic from a uniform has density
m!
fV(i) (x) = v i−1 (1 − v)m−i
(i − 1)! (m − i)!
Consequently,
m!
E 1 − V(i) = v i−1 (1 − v)m+1−i dv
(i − 1)! (m − i)!
m! (i − 1)! (m + 1 − i)!
=
(i − 1)! (m − i)! (m + 2)!
i
= .
m+1
It follows that the locally most powerful test when the underlying density is logistic is
given by
m
f (F −1 (V ))
m
+ ,
− E = − E −1 + 2 1 − V(ri )
i=1
f (F −1 (V )) i=1
m
Ri
= 2 −m
i=1
m+1
166
8. Optimal Rank Tests
Example 8.2 (Fisher-Yates Test). Suppose that we have a random sample of size m
from the normal distribution with mean μ and variance σ 2 > 0 given by
1 1 x−μ 2
f (x) = √ e− 2 ( σ ) − ∞ < x, μ < ∞.
2πσ
Then,
∂ 1 x−μ
log f (x) = − .
∂x σ σ
Once again, using the density of the ith order statistic, we see that
X(i) − μ m!
E = z [Φ (z)]i−1 f (x) [1 − Φ (z)]m−i dz
σ (i − 1)! (m − i)!
1
m!
= Φ−1 (u) [u]i−1 [1 − u]m−i du
0 (i − 1)! (m − i)!
= E Φ−1 V(i) ,
where V(i) is the ith order statistic from a random sample of m uniform random variables
on the interval (0, 1). There is no closed form for this last expectation, though there are
various approximations.2 In fact, using the delta method first order approximation,
−1 −1
−1 i
E Φ V(i) ≈ Φ E V(i) = Φ .
m+1
m
E Φ−1 V(Ri )
i=1
m
−1 Ri
Φ .
i=1
m+1
2
See also Royston (1982) who provides the approximation E X(i) = μ +
−1 i ( m+1
i
)(1− m+1
i
)
σΦ m+1 1+ 2 .
2(m+2)[φ[Φ −1
m+1 ]]
i
167
8. Optimal Rank Tests
where θ = (θ1 , . . . , θk ) represents the parameter vector for sample (= 1, 2) and x1j ,
x2j are the data from sample 1 and sample 2 with respective sizes m and n − m that are
associated with the ranking (permutation) νj , j = 1, . . . , n!.
Under the null hypothesis H0 : θ1 = θ 2 , we can assume without loss of generality
that the underlying V1 , . . . , Vn from the combined sample are i.i.d. uniform (by consid-
ering G(Vi ), where G is the common distribution function, assumed to be continuous,
of the Vi ) and that all rankings of the Vi are equally likely. Hence (12.5) represents
an exponential family constructed by exponential tilting of the baseline measure (i.e.,
corresponding to H0 ) on the rank-order data. This has the same spirit as Neyman’s
smooth test of the null hypothesis that the data are i.i.d. uniform against alternatives
in the exponential family (4.1). The Neyman-Pearson lemma can be applied to show
that the score tests have maximum local power at the alternatives that are near θ = 0.
The Neyman parametric embedding in (4.1) makes these results directly applicable to
the rank-order statistics. In particular, this shows that the two-sample Wilcoxon test
of H0 is locally most powerful for testing the uniform distribution against the truncated
exponential distribution for which the data are constrained to lie in the range (0, 1) of
the uniform distribution. Note that these exponential tilting alternatives differ from the
location alternatives in the preceding paragraph not only in their distributional form
(truncated exponential instead of logistic) but also in avoiding the strong assumption of
the preceding paragraph that the data have to be generated from the logistic distribution
even under the null hypothesis (8.7).
Similar results are also valid for tests against scale alternatives. Define the scores
i
∼ (i)
a1n (i, f ) = ϕ1 EVn , f = ϕ1 ,f ,
n+1
where
f (F −1 (v))
ϕ1 (v, f ) = −1 − F −1 (v) , 0 < v < 1.
f (F −1 (v))
∞
Theorem 8.2 (Scale Alternatives). Assume −∞ |xf (x) dx| < ∞. The test with critical
region
n
ci a1n (Ri f ) ≥ k
i=1
168
8. Optimal Rank Tests
"
n
H0 : f (x1 , . . . , xn ) = fX (xi )
i=1
against
n "
n
H1 : f (x1 , . . . , xn ) = exp −Δ ci fX (xi − μ) e−Δci , Δ > 0,
i=1 i=1
Example 8.3 (Locally Most Powerful Test for Scale). Suppose that F has an absolutely
continuous density f for which |xf (x)| dx < ∞. We would like to test
Let
m
f V(rj )
Sm = EF −1 − V(qj ) ,
j=1
f V(rj )
where V(1) < . . . < V(n) are the order statistics of a random sample of N from F. Then
the locally most powerful rank test is given by
⎧
⎪
⎪
⎨1 Sm > kα
φ (q) = γ Sm = kα
⎪
⎪
⎩0 Sm < kα
In the special case where F is a normal distribution with mean 0 and variance σ 2 , then
the locally most powerful rank test is based on the sum
m
2
Sm = E Z(i) ,
j=1
where Z(1) < . . . < Z(N ) are the order statistics of a random sample of N from a standard
normal distribution. On the other hand, if f (x) = e−x , x > 0, then the locally most
powerful rank test is based on the sum of the Savage scores given by
m
j
Sm = log 1 − −1 .
j=1
N +1
169
8. Optimal Rank Tests
One may well ask whether the unified theory described in Chapter 6 produces locally
most powerful tests. Indeed we recall the locally most powerful tests are a subset of the
class of general linear rank statistics. Consider the two-sample problem for a location
alternative. The locally most powerful test rejects for small values of
n1
g (V(μ(i)) )
Eg − ,
i=1
g(V(μ(i)) )
where V(1) < . . . < V(n) are the order statistics of a random sample of size n from g. Let
g (V(j) )
h (j) = Eg −
g(V(j) )
Expanding this sum, we see that the test statistic induced by this distance function can
be shown to be
n1 n1
g (V(μ(i)) )
h (μ (i)) = Eg − .
i=1 i=1
g(V(μ(i)) )
The demonstration is identical to the one using the permutations directly.
xi = (xi1 , . . . , xip )
be the corresponding covariates. Denote the order statistics Y(1) < . . . < Y(n) and let
their ranks be denoted by R = (R1 , . . . , Rn ) . The probability distribution of R is given
by
" n
f (r; Δ) = f (ui − xi β) dui (8.10)
i=1
170
8. Optimal Rank Tests
where integration is over the set {(u1 , . . . , un ) |u1 < . . . < un } . Kalbfleisch (1978) arrived
at (8.10) through group considerations. Specifically, it was argued that the regression
model (8.9) conditional on x is invariant under the group of increasing differentiable
transformations acting on the response. In the parameter space, this leaves β invariant.
We may obtain the locally most powerful test of the hypothesis H0 : β = 0 by computing
the score function U at β = 0
∂ log f (r; β)
U= . (8.11)
∂β β=0
1 ∂f (r; β)
Ul =
f (r; 0) ∂β
" n
n
= n! f (ui ) − x(j)l a (j) dui
i=1 j=1
n
= − x(j)l a (j)
j=1
n
= − xjl a (Rj ) , (8.12)
j=1
where
∂ log f Z(i)
a (i) = E (8.13)
∂Z(i)
and Z(i) is the ith order statistic of a sample of size n from f . Hence, when p > 1 we
obtain a vector of linear rank statistics.
If p = 1, then
n
xj a (Rj )
j=1
is the usual simple linear rank statistic. There is a close connection in that case with
the usual Pearson correlation coefficient as we saw in Section 3.1 which we recall in the
next example.
Yi = α + Δxi + εi , i = 1, . . . , n
171
8. Optimal Rank Tests
where {εi } are independent identically distributed random variables from a cumulative
distribution function have median 0. We wish to test
H0 : Δ = Δ 0
against
H1 : Δ = Δ0 .
Assume x1 < . . . < xn and let Ri denote the rank of the residual Yi − Δ0 xi among
{Yj − Δ0 xj } . The usual Pearson correlation coefficient is given by
n
i=1 (xi − x̄n ) (a (Ri ) − ā)
ρn =
2 1/2
,
n 2 n
i=1 (x i − x̄ n ) i=1 (a (R i ) − ā n )
where the score function is an increasing function with a (1) = a (n) and
1
n
ā = a (Ri ) .
n i=1
n
xi a (Ri ) .
i=1
172
8. Optimal Rank Tests
then
a (i) = E 1 − eZ(i) .
We may obtain a general form for the score test in the regression model. The score
function U evaluated at Δ = 0 has variance-covariance matrix
I0 = E {U U } = E xil xi l a (Ri ) a (Ri ) .
ii
n
Suppose that i=1 xil = 0 and ii xil xi l = 0. Then,
E xil xi l a (Ri ) a (Ri ) = xil xi l E (a (Ri ) a (Ri ))
ii ii
= xil xil Ea2 (Ri ) + xil xi l E (a (Ri ) a (Ri ))
i i
=i
= xil xil Ea2 (Ri ) − xil xil E (a (Ri ) a (Ri ))
i i
= xil xil Ea2 (Ri ) − E (a (Ri ) a (Ri ))
i
Now note
1 2
n
1
Ea (Ri ) − E (a (Ri ) a (Ri )) =
2
a (i) − a (i) a (i )
n i=1 n (n − 1) i
=i
2
1 2
n
1
= a (i) − a (i) − 2
a (i)
n i=1 n (n − 1)
1 n
= (a (i) − ā)2
(n − 1) i=1
n
a(i)
where ā = i=1n .
Now set the design matrix X = (xij ) . It follows that the ll term of X X is given by
n
xil xil ,
i=1
173
8. Optimal Rank Tests
and hence using (8.12) and (8.13), the score test takes the form
U I0−1 U.
where the oj with 1 ≤ o1 < o2 < . . . < oki ≤ t are the labels of the actual objects being
ranked. Occasionally we may wish to represent this as a t-vector in which missing ranks
are represented by the symbol “_.” If ki = t, the ranking is said to be complete, otherwise
it is incomplete. We wish to test the hypothesis H0 : each judge, when presented with
the specified ki objects, picks the ranking at random from the space of ki ! permutations
of (1, 2, . . . , ki ).
Instead of basing our statistic directly on the ranks themselves, we may wish to
replace the ranks assigned by each judge by real valued score functions
a (j, ki ) , 1 ≤ j ≤ ki ≤ t.
In order to motivate the discussion, we begin in the next section with the complete case
where ki = t.
174
8. Optimal Rank Tests
H0∗ : Each judge picks a complete ranking at random from the space of t!
permutations of (1, . . . , t)
For a given function a (j, t), 1 ≤ j ≤ t, and a ranking R define the vector of adjusted
scores
t
−1
a (R) = (a (R (1) , t)−at , a (R (2) , t)−at , . . . , a (R (t) , t) − at ) , where at = t a (r, t) .
r=1
(8.14)
Under H0∗ , E (a (R (j) , t)) = at , and the covariance matrix of a (R) is given by
t 1
Σ0 = σ I− J
2
(8.15)
t−1 0 t
where
t
−1
σ02 ≡ V ar (a (R (j) , t)) = t (a (r, t) − at )2 , (8.16)
r=1
t
A (R1 , R2 ) = a (R1 ) a (R2 ) = (a (R1 (j) , t) − at ) (a (R2 (j) , t) − at ) . (8.17)
j=1
Let the (t × t!) matrix T represent the collection of adjusted score vectors a (R) as R
ranges over all its t! possible values, and let f be the t! vector of frequencies of the
observed rankings. The (t! × t!) matrix T T has components A (R1 , R2 ) with R1 and
R2 ranging over all t! permutations of (1, 2, . . . , t). With Ri , i = 1, . . . , n, representing
the observed rankings, and
n
Sn (j) = a (Ri (j) , t) ,
i=1
175
8. Optimal Rank Tests
let
S n = (Sn (1) , Sn (2) , . . . , Sn (t)) ,
so that T f = (S n − nat 1) , where 1 is the t−vector of 1’s. Proceeding as in (Alvo and
Cabilio (1991)), the proposed statistic is the quadratic form
t
−1 1 1 −1
n f (T T ) f = √ Tf √ Tf =n (Sn (j) − nat )2 . (8.18)
n n j=1
t−1
t
Qn = t 2 (Sn (j) − nat )2 (8.19)
n r=1 (a (r, t) − at ) j=1
Ωh = {1 ≤ j ≤ t | δh (j) = 1} ,
Ω∗ = {Ωh | 1 ≤ h ≤ b} .
176
8. Optimal Rank Tests
Let rj = bh=1 δh (j) denote the total number of blocks in the basic design that include
b
object j, and λjj = h=1 δh (j)δh (j ) the number of such blocks which include both
objects 1 ≤ j = j ≤ t, so that nrj judges rank object j, and nλjj judges rank both
objects j and j . For an incomplete ranking pattern Ωh , let
{Rh(s) , s = 1, 2, . . . , kh !}
represent the set of possible kh -rankings, that is the permutations of (1, 2, . . . , kh ) within
the specified incomplete pattern. For any such incomplete kh -ranking with a given
pattern of missing observations, associate a matrix of compatibility C h whose s =
1, 2, . . . , kh ! columns are indicators identifying which of the t! complete rankings in-
dexing its rows, are compatible with the particular kh -permutation Rh(s) . The analogue
of the adjusted matrix T , the collection of adjusted score vectors a(R) is given by
T ∗ = (T ∗1 | T ∗2 | . . . | T ∗b )
where
kh !
T ∗h = T C h.
t!
Denote by C Rh(s) the class of complete rankings compatible with the specified kh -
permutation Rh(s) indexed by column s of C h . Under H0 , the columns of T ∗h are the
conditional expected adjusted scores
E a(R) C Rh(s) .
This conditional expectation provides the appropriate weighting for scores in an in-
complete design. As Theorem 8.3 below indicates, this weighted score is given by the
following definition.
Definition 8.1. For a given score function a (j, t), 1 ≤ j ≤ t, if object j is ranked in a
given block 1≤ h ≤ b, and if Rh (j) = r, the weighted score is given by
t−k h +r −1
∗ q−1 t−q t
a (r, kh ) = a (q, t) , r = 1, 2, . . . , kh . (8.20)
q=r
r−1 kh − r kh
177
8. Optimal Rank Tests
compatible with Rh(s) . If object j is ranked in pattern Ωh , and if Rh (j) = r, then there
are exactly
q−1 t−q
(t − kh )!
r − 1 kh − r
complete compatible rankings R, for which
R(j) = q, r ≤ q ≤ t − kh + r,
so that the average of such scores a (R (j) , t) is given by (8.20). If on the other hand ob-
ject j is not ranked in pattern Ωh , there are (t − 1)!/kh ! complete rankings R compatible
with Rh(s) for which R(j) = q, 1≤ q ≤ t, so that the sum of such scores is
t
(t − 1)!/kh ! a (q, t) ,
q=1
Note that if kh , the number of objects ranked is the same for each block, the weighted
scores in (8.20) are all the same function of r. Note further that
t−k+r
q−1
t−q
−1
t
= 1.
q=r
r−1 k−r k
kh
t
kh−1 ∗ ∗
a (r, kh ) = E (a (Rh (j) , kh )) = E (a (R (j) , t)) = at = t −1
a (r, t) ,
r=1 r=1
178
8. Optimal Rank Tests
and
−1
kh
γh2 = (kh − 1) (a∗ (r, kh ) − at )2 .
r=1
b
Proceeding as in Alvo and Cabilio (1999), define the h=1 kh ! dimensional vector
f = (f 1 | f 2 | · · · | f b ) ,
Analogous to the complete case statistic the test statistic in this more general setting is
given by
−1 ∗ 1 ∗
∗ 1 ∗
n f (T T ) f = √ T f √ T f
n n
2
t
nb
−1 ∗
= n (a (Ri (j) , ki ) − at ) δi (j) . (8.23)
j=1 i=1
Set
∗
S nb = (Snb ∗
(1) , Snb ∗
(2) , . . . , Snb (t)) , r = (r1 , r2 , . . . , rt ) ,
where
nb
∗
Snb (j) = a∗ (Ri (j) , ki ) δi (j) .
i=1
Since
nb
δi (j) = nrj
i=1
it follows that
nb
(S nb − nāt r) = a∗ (Ri ) .
i=1
t
−1 −1 ∗
Gn = n (S nb − nāt r) (S nb − nāt r) = n (Snb (j) − nrj at )2 . (8.24)
j=1
179
8. Optimal Rank Tests
b
Σ0 = γi2 Ai , (8.25)
i=1
n−1 (T ∗ f ) Σ− ∗
0T f = n
−1
(S nb − nāt r) Σ−
0 (S nb − nāt r) (8.26)
where Σ−
0 is a generalized inverse of Σ0 .
t−k i +r −1 −1
∗ q t−q t t+1 t t+1
a (r, ki ) = r =r =r , (8.27)
q=r
r ki − r ki t − ki ki ki + 1
This form is derived in Alvo and Cabilio (1991). Further at = (t + 1) /2, so that
b
Sb∗ (j) − rj at = (t + 1) (ki + 1)−1 (Ri (j) − (ki + 1) /2) δi (j) , (8.28)
i=1
t 2 b 2
t+1
t ki + 1
Sb∗ (j) − rj = (t + 1) 2
(ki + 1)−1 Ri (j) − δi (j) .
j=1
2 j=1 i=1
2
(8.29)
1 2 −1
In this case, γi2 = 12
(t + 1) ki (ki + 1) .
180
8. Optimal Rank Tests
score functions a (r, t) is derived. For various distributions of (Xi1 , Xi2 , . . . , Xit ), iden-
tical over the blocks, Sen (1968b) derives the form of the optimal scores a (r, t) in the
sense that Δn (a) = Δ0n . In this sense, the Wilcoxon score statistic discussed above is
optimal in the case that the rankings result from complete samples from the logistic
distribution with density f (x) = e−x / (1 + e−x ) ; −∞ < x < ∞.
2
t−k i +r −1
∗ q−1 t−q t
a (r, ki ) = a (q, t) = K (ki , t) a (r, ki ) , (8.30)
q=r
r − 1 ki − r ki
where K (ki , t) depends only on the number of objects ranked in the block. Other score
functions considered by Sen (1968) also have this property. In particular we have the
following.
(a) With the score a (1, t) = 1, a (t, t) = −1, a (r, t) = 0 otherwise, Qn is optimal when
sampling from the uniform distribution with density f (x) = 1; 0 ≤ x ≤ 1. Direct
substitution into (8.20) gives a∗ (r, ki ) = kti a (r, ki ) .
(b) With the score a (1, t) = 1, a (r, t) = 0 otherwise, Qn is optimal when sampling
from the exponential distribution with density f (x) = e−x ; 0 ≤ x < ∞. Again,
direct substitution into (8.20) gives a∗ (r, ki ) = kti a (r, ki ) .
(c) When sampling from the double exponential distribution with density f (x) = e−|x| ;
t −t
−∞ < x < ∞, Qn associated with the score function a (r, t) = 1 − 2 r−1 i=0 i 2 is
∗
shown to be optimal. In order to derive the form of a (r, ki ) in this case we make
use of the following lemma.
Lemma 8.2. Let F (x; n) = xi=0 ni pi (1 − p)n−i , 0 ≤ x ≤ n, be the cumulative bino-
mial distribution function. Then, for all 1 ≤ x ≤ m ≤ n
n−m+x −1
i−1 n−i n
F (x − 1; m) = F (i − 1; n) . (8.31)
i=x
x−1 m−x m
Proof. See Alvo and Cabilio (2005) for the proof of this lemma.
Suppose that we have random variables Xi1 , . . . , Xit which may or may not be observ-
able but which underlie the ranks. We assume that the variables are independently
distributed with continuous cdf Fi1 (x) , . . . , Fit (x) respectively for i = 1, . . . , n. We
would like to test the null hypothesis that
181
8. Optimal Rank Tests
t
Fij (x) = Fi (x − τj ) , j = 1, . . . , t; τj = 0.
j=1
From (8.20)
t−k +r q−1 −1
i t q−1 t−q t
a∗ (r, ki ) = 1 − 2 2−t ,
q=r i=0
i r−1 ki − r ki
ki −ki
we have that the quantity in square brackets is equal to F (r − 1; ki ) = r−1 i=0 i 2 ,
∗
so that a (r, ki ) = a (r, ki ) .
i ∗
The scores considered in this section have certain properties. Since ki−1 kr=1 a (r, ki ) =
at , it follows that for all such scores, at = K (k, t) āk . Further, if the design is such that
the number of objects ranked in each block is constant, that is ki = k for all i, then the
incomplete scores are equivalent to the complete case ones.
For many designs, including balanced incomplete blocks, cyclic, and group divisible
designs, the eigenvalues of A are found analytically. See Alvo and Cabilio (1999) for
details.
182
8. Optimal Rank Tests
The asymptotics and the null hypothesis may be recast in a different setting. Specif-
ically consider the situation in which random variables (Xi1 , Xi2 , . . . , Xit ) , which may
or may not be observable, underlie the rankings
These random variables are assumed independent with absolutely continuous distribu-
tion functions Fij (x) = Fi (x − τj ) , where τj = 0, and Fi (x) have continuous densities
∞
fi (x) , for which −∞ fi2 (x) dx < ∞. The null hypothesis of random uniform selection
of rankings becomes H0 : τ = 0, where τ = (τ1 , τ2 , . . . , τt ). If the asymptotics of interest
are simply that the number of blocks b becomes large, the definitions and notation used
earlier may be modified by setting n = 1 as appropriate. The test statistic may be
rewritten as
G∗b = (S b − āt r) Σ−
0 (S b − āt r) (8.33)
where Σ0 defined in (8.25) is the covariance matrix, under H0 , of
(S b − āt r) = (Sb∗ (1) − āt r1 , Sb∗ (2) − āt r2 , . . . , Sb∗ (t) − āt rt ) . (8.34)
represent the probability that if object j is being ranked it will be assigned the rank
r, 1 ≤ r ≤ kh . Denote the mean score for this ranking pattern by
kh
a∗ (r, kh ) phr .
(j)
μh (j) = (8.35)
r=1
The variables
t
sb
Us = cj [a∗ (Ri (j) , ki ) δi (j) − μi (j)] , s = 1, 2, . . . , n (8.36)
j=1 i=(s−1)b+1
183
8. Optimal Rank Tests
are independent with zero means, and with n−1 ns=1 E |Us |3 bounded. Thus, by the
Lindeberg-Feller Central Limit Theorem (Chapter 2.1) it follows that as n → ∞,
n−1/2 ns=1 Us is asymptotically normal for all c = (c1 , c2 , . . . , ct ) = (1, 1, . . . , 1). Thus
we have that n−1/2 (S nb − E (S nb )) has an asymptotic multivariate normal distribution.
Note that the expected values of the elements of S nb are
b
∗
E (Snb (j)) = nμb (j) = n μh (j) δh (j) . (8.37)
h=1
Specializing the situation described in the previous section, assume that the ab-
solutely continuous distributions of the (Xi1 , Xi2 , . . . , Xit ) , are of the form Fij (x) =
∞
F (x − τj ) , with continuous density f (x) for which −∞ f 2 (x) dx < ∞. Consider the
alternatives
H1n : τ = τ n = n−1/2 θ, θ = (θ1 , θ2 , . . . , θt ) .
In order to investigate the asymptotic distribution of the test statistic
under such local translation alternatives, we use arguments and notation similar to those
in (Sen (1968b)). Thus let
∞
kh − 2
(h)
βs = F (x)r−2 (1 − F (x))kh −r f 2 (x) dx, s = 0, 1, . . . , kh − 2
s −∞
(h) (h)
where β−1 = βkh −1 = 0. By definition, for all j ∈ Ωh ,
(j)
∞
phr = P Xs1 ≤ x, . . . , Xsr−1 ≤ x, Xsr+1 > x, . . . , Xskh > x|Xj = x dFj (x) ,
Sj −∞
(8.38)
where the summation extends over all possible choices of (s1 , . . . , sr−1 ) from Ωh \ {j} ,
with (sr+1 , .., skh ) the complementary set. Use of the distributional assumptions, inde-
pendence, and Taylor series expansion and some manipulation shows that the probability
in (8.38) can be written as
(h)
phr = kh−1 + n−1/2 kh θj − θh βr−2 − βr−1 + o n−1/2 ,
(j) (h)
where θh = kh−1 θj , and the summation is over all j ∈ Ωh . Hence we see that μb (j)
defined in (8.35) is related to the null mean through
μb (j) − rj āt = η (j, a∗ ) + o n−1/2 , (8.39)
184
8. Optimal Rank Tests
where
b
kh
η (j, a∗ ) = n−1/2 kh θj − θh δh (j) a∗ (r, kh )
(h) (h)
βr−2 − βr−1 . (8.40)
h=1 r=1
(j,j ) 1
ph,rs = + o n−1/2 . (8.41)
kh (kh − 1)
can be written as
kh −1 −1/2
γh2 +o n δh (j)
kh
on the diagonal and
1 −1/2
−γh2 +o n δh (j) δh (j )
kh
off the diagonal. It follows that
Cov n−1/2 S nb − Σ0 → 0 as n → ∞,
where Σ0 is defined in (8.25). Combining this result with the asymptotic normality of
n−1/2 Snb it follows that if the basic design is connected, the statistic
∗
[η (1, a∗ ) , . . . , η (t, a∗ )] Σ− ∗
0 [η (1, a ) , . . . , η (t, a )]
185
8. Optimal Rank Tests
where Θ is the vector with components bh=1 θj − θh δh (j) , j = 1, 2, .., t, and A is
defined in (8.32). This parameter may be written as a multiple of a correlation coefficient
which is maximized when a∗ (r, k) − at = c (βr−2 − βr−1 ). This relationship leads to the
same conclusions as in Sen (1968b), which is that the score functions detailed in Section 4
are optimal for the stated underlying distributions.
8.4. Exercises
Exercise 8.1. Let X have an exponential distribution with mean 1. Find the optimal
score statistic for the scale alternative.
Exercise 8.2. Let X have a Cauchy distribution. Find the optimal score statistic for
the location alternative.
186
9. Efficiency
In this chapter, we consider the asymptotic efficiency of tests which requires knowledge
of the distribution of the test statistics under both the null and the alternative hypothe-
ses. In the usual cases such as for the sign and the Wilcoxon tests, the calculations
are straightforward. However, the calculations become more complicated in the multi-
sample situations and for these, we appeal to Le Cam’s lemmas. This is illustrated in
the case of test statistics involving both the Spearman and Hamming distances. The
smooth embedding approach is useful in that for a given problem, it leads to test statis-
tics whose power function can be determined. The latter is then assessed against the
power function of the optimal test statistic derived from Hoeffding’s formula for any
given underlying distribution of the data.
αn = sup βn (Tn ; θ) .
θΩ0
σ 2 (θn )
where it is convenient to let μ (θn ) denote the mean of Tn and n
denote the
variance of Tn .
(iii) the test rejects for large values of the test statistic, that is
√
n (Tn − μ (0))
> zα
σ (0)
where
1 (μ (θn ) − μ (0)) μ (0)
c = lim = .
n→∞ σ (θn ) θn σ (0)
The quantity c is called the slope of the test. The larger the slope, the more powerful
the test. A comparison of two tests Tn,1 , Tn,2 may then be made in terms of the ratio of
188
9. Efficiency
Example 9.1 (One-Sample Tests: Sign Test, t-Test, Wilcoxon Signed Rank Test).
We first calculate the slope of the sign test. Suppose that we have a random sample
of size n from some distribution having absolutely continuous cumulative distribution
function F (x − θ) which is symmetric around its median θ. We would like to test the
null hypothesis H0 : θ = 0 and the one-sided alternative H1 : θ > 0. Let Tn,1 be the
proportion of Xi > 0. The sign test rejects the null hypothesis for large values of Tn,1 . Set
p = P (X > 0) = 1 − F (−θ)
Then, Tn,1 has a binomial distribution with parameters n and probability of success p.
Hence,
μ (θ) = Eθ [Tn,1 ] = p,
p (1 − p)
V arθ [Tn,1 ] = .
n
It follows that σ 2 (θ) = p (1 − p) and the slope is
μ (0) 2f (0)
= , (9.2)
σ (0) σ
μ (θ) = E [Tn,2 ]
= xdF (x − θ)
= θ
189
9. Efficiency
It follows that
μ (0) 1
= . (9.3)
σ (0) σ
Hence, when f (x) is the standard normal density, the asymptotic relative efficiency of
the sign test to the t-test is
2
2
2
(2f (0)) = √ = 0.637.
2π
n
Tn,3 = Ri+ I (Xi > 0) ,
i=1
where as in Section 5.2, Ri+ is the rank of |Xi | among the {|Xi | , i = 1, . . . , n} . It follows
that under the null hypothesis
n (n + 1)
E0 [Tn,3 ] = ,
4
n (n + 1) (2n + 1)
V ar0 [Tn,3 ] = .
24
The calculations of the mean and variance under the alternative for the Wilcoxon signed
rank statistic are more involved and are given by the following theorem.
p1 = P (X > 0)
p2 = P (X1 + X2 > 0)
p3 = P (X1 + X2 > 0, X1 > 0)
p4 = P (X1 + X2 > 0, X1 + X3 > 0) .
Then, we have,
n (n − 1)
E [Tn,3 ] = np1 + p2
2
190
9. Efficiency
and
n (n − 1)
V ar [Tn,3 ] = np1 (1 − p1 ) + p2 (1 − p2 )
2
+2n (n − 1) (p3 − p1 p2 ) + n (n − 1) (n − 2) p4 − p22
(p2 +p21 )
Proof. See page 47 of Hettmansperger (1994). Here p3 = 2
.
p1 = 1 − F (−θ)
≈ 1 − [F (0) − θf (0)]
1
= − θf (0) .
2
Also,
p2 = P (X1 + X2 > 0)
≈ 1 − F ∗ (−2θ)
1
= + 2θf ∗ (0)
2
where F ∗ is the convolution distribution and f ∗ is its density. Using the symmetry of f
whereby f (x) = f (−x), we have that
∞
∗
f (0) = f 2 (x) dx.
−∞
191
9. Efficiency
In order to compute the slope for the Wilcoxon signed rank test, we note that
n (n − 1)
μ (θ) = E [Tn,3 ] = np1 + p2
2
so that ∞
μ (0) = nf (0) + n (n − 1) f 2 (x) dx
−∞
and
μ (0) √ ∞
= 12 f 2 (x) dx. (9.4)
σ (0) −∞
The calculations of the slopes in (9.2), (9.3), and (9.4) enable us to compute the
asymptotic relative efficiencies.
1
m n
Tm,n = I (Xi < Yj )
mn j=1 i=1
μ (Δ) = E [Tm,n ]
= P (X < Y )
∞
= F (y) f (y − Δ) dy
−∞
∞
= F (y + Δ) dF (y)
−∞
1
V ar [Tm,n ] = mnq 1 (1 − q 1 ) + mn (n − 1) (q 2 − q 2
1 ) + mn (m − 1) (q 3 − q 2
1 )
(mn)2
≈ o (1) + m−1 (q2 − q12 ) + n−1 (q3 − q12 )
192
9. Efficiency
so that for m
m+n
→ λ > 0,
using similar arguments following Theorem 9.1. It follows that the slope is given by
∞
μ (0) -
= 12λ (1 − λ) f 2 (x) dx. (9.5)
σ (0) −∞
By way of comparison, we may also compute the slope for the two-sample t-test
using the statistic which rejects the null hypothesis for large values. It is given by the
Table 9.1.: Relative efficiency of the Wilcoxon two-sample test relative to the t-test
Distribution Relative efficiency
Normal 3/π
Logistic π 2 /9
Uniform 1
ratio
Ȳm − X̄n
√ ,
Sn,m m−1 + n−1
2
where X̄n , Ȳm represent the sample means and Sn,m represent the pooled sample variance
2
estimate of the common variance σ . It is straightforward to show that the slope of the
two-sample t-test is
-
μ (0) λ (1 − λ)
= . (9.6)
σ (0) σ
We record in Table 9.1 the relative efficiency of the Wilcoxon test relative to the
t-test for various underlying distributions.
193
9. Efficiency
Example 9.3 (Median Test). Consider the two-sample test in Example 9.2 with m
observations in the first sample and n in the second. Set N = m + n. The median test
rejects the null hypothesis that the medians in the two samples are the same for large
values of
1
N
n+1
TN = I Ri ≤
N i=m+1 2
where R1 , . . . , RN are the ranks of the complete data and I (A) is the indicator function
of the set A. Under the null hypothesis,
√
n 1 m
1
N TN − = √ −n I F (Xi ) ≤
2N N N i=1
2
n
1
+m I F (Yi ) ≤ Bigg) + oP (1) (9.7)
j=1
2
L λ (1 − λ)
−
→ N 0, (9.8)
4
since the left-hand side of (9.7) can be expressed as a linear rank statistic. The right-
hand side of (9.7) is the projection (see Exercise 3.8). On the other hand, for alternatives
θN = √hN , the log likelihood ratio given by
# # √
h 1 − λ g
n
f (Xi ) g (Yj − θN ) h2 (1 − λ) Ig
log # # =− √ (Yi ) − + oP (1) (9.9)
f (Xi ) g (Yj ) n i=1
g 2
and hence is asymptotically normal. Moreover the joint distribution of the linear parts
of the right-hand sides of (9.7) and (9.9) is multivariate normal. Slutsky’s lemma im-
plies a similar result for the right-hand sides. Hence by Le Cam’s third lemma, under
alternatives θN = √hN ,
√ n L λ (1 − λ)
N Tn − −
→N τ (h) , ,
2N 4
where
f (y)
τ (h) = −hλ (1 − λ) dF (y)
F (y)≤1/2 f (y)
is the asymptotic covariance. The slope of the median test is then given by
- 1/2
f (v)
−2 λ (1 − λ) F −1 (v) dv. (9.10)
0 f (v)
194
9. Efficiency
∞ 2
f (x)
I (f ) = f (x) dx < ∞ (9.11)
−∞ f (x)
f (F −1 (v))
ϕ (v, f ) = ,0 < v < 1
f (F −1 (v))
H1 : Fk (x) = Fk (x − Δk ) , 1 ≤ k ≤ r,
where
1
Nr
Δ1 < . . . < Δr , ¯
d= di
Nr i=1
and
di = Δk , Nk−1 < i ≤ Nk . (9.12)
We shall also assume that as min {n1 , . . . , nr } → ∞, the following regularity conditions
hold under the alternative
Nr1/2 Δk → δk (9.13)
2
max di − d¯ → 0 (9.14)
1≤i≤Nr
Nr
2
I (f ) di − d¯ → b2 , a finite constant. (9.15)
i=1
195
9. Efficiency
We may determine the locally most powerful test for this problem by using Hoeffding’s
lemma 8.1. In fact, provided ∞
|f (x)| dx < ∞,
−∞
Nr
T0 = − di − d¯ ϕ Xi − d,
¯f (9.16)
i=1
Nr
f Xi − d¯
= − di − d¯ (9.17)
i=1
f Xi − d¯
which under H1 has an asymptotic normal distribution with mean 0 and variance
Nr
2
b2 = I (f ) di − d¯ . (9.18)
i=1
and put
aϕn (i) = E {ϕ (V1 ) |π (1) = i} , 1 ≤ i ≤ n.
Then
lim E {aϕn (π (1)) − ϕ (V1 )}2 = 0.
n→∞
n
Sn = (ck − c̄) an (μ (i)) ,
i=1
and the {ck } satisfy Noether’s condition. Then under the regression alternatives,
Sn − m n L
−
→ N (0, 1) ,
σn
196
9. Efficiency
where
n
1
mn = (ck − c̄) di − d¯ ϕ (v) ϕ (v, f ) dv
i=1 0
n 1
2
σn2 = (ck − c̄) (ϕ (v) − ϕ̄)2 dv.
i=1 0
Corollary 9.1. Consider the Spearman statistic for the multi-sample ordered location
problem
μ (i)
Sr = c (i)
Nr + 1
where for 1 ≤ k ≤ r
c (i) = Nk−1 + Nk , Nk+1 < i ≤ Nk .
Then as min {n1 , . . . , nr } → ∞, Sr is, under the alternative, asymptotically normal with
mean and variance given respectively by
Nr2
r N 1
mr = + (ck − c̄) di − d¯ vϕ (v, f ) dv
2 i=1 0
Nr3
σr2 = wk Wk Wk−1
12
with
nk k
→ w k , Wk = wi .
Nr i=1
Nr
r
¯
(ck − c̄) di − d = Nr
3/2
wk δk (Wk + Wk−1 − 1)
i=1 i=1
197
9. Efficiency
and
Nr
2
Nr
2
¯
di − d = w k δk − w k δk
i=1 i=1
⎛ r ⎞
1
(Wk−1 + Wk − 1) 0 vϕ (v, f ) dv
Sr − E0 Sr k=1 wk δk
lim P √ > kα = 1 − Φ ⎝kα −
⎠ (9.20)
min{ni }→∞ V ar0 Sr 1 r
12 k=1 w k W k−1 W k
On the other hand, the asymptotic power of the locally most powerful test defined
by the optimal score function is given by
⎧ 2 ⎫
⎨r
r ⎬
I (f ) w k δk − w k δk .
⎩ ⎭
k=1 k=1
In Table 9.2, we record various integrals which enable us to compute the efficiencies.
To illustrate the calculation, note that for the standard normal, using a change of variable
and integration by parts,
1/2 1/2
198
9. Efficiency
1 uϕ (u) du
√1 + 4√1 π ≈ 0.3405 3 5
1//22 2 2π 8 24
uϕ (u) du − 2√12π + 4√1 π ≈ −0.0585 − 18 − 241
01 2
√1 + 1
arctan 1/√2 + π ≈ 0.2961 7 17
1 u ϕ (u) du
3/2
4 2π 2(π) 2
1//22 2 √
24 96
u ϕ (u) du − 4√12π − 1 3/2 arctan1/ 2 + π2 + 2√1 π ≈ −0.0141 − 241
− 961
0 2(π)
v2
1
Normal: f (x) = √2πσ e− 2σ2 , −∞ < x < ∞
Double exponential: f (x) = 12 e−|x| , −∞ < x < ∞
−2
Logistic: f (x) = e−x (1 + e−x ) , −∞ < x < ∞.
Nr
T1 = di − d¯ ϕ (Vi , f ) ,
i=1
Nr
μ (i)
T2 = di − d¯ ϕ ,f .
i=1
Nr + 1
199
9. Efficiency
and
n
f (x − di )
log Ln = log
i=1
f s − d¯
Then
b2 P
log Ln − T1 + −
→0
2
2
as min {n1 , . . . , nr } → ∞. Moreover, log Ln is asymptotically N − b2 , b2 .
Proof. The asymptotic normality of T1 follows from the Lindeberg condition. See The-
orem VI.2.1 of Hájek and Sidak (1967) for details of the proof.
The next theorem provides the general result which is useful for obtaining the non-
null distribution of test statistics involving more general distance functions such as that
of Hamming.
Nr
T = aiπ(i) ,
i=1
Assume as well I (f ) < ∞ holds. Then under the alternative H1 the asymptotic distri-
bution of T is N (σ12 , 1) where
σ12 = E0 [T T2 ]
and
Nr
π (i)
T2 = di − d¯ ϕ ,f .
i=1
Nr + 1
Proof. The proof follows closely the proof in Theorem VI.2.4 of Hájek and Sidak (1967).
Recall
Nr
T1 = di − d¯ ϕ (Vi , f ) .
i=1
200
9. Efficiency
Under H0 from Theorem 9.4, T1 and T2 are asymptotically equivalent. Under H0 , The-
orem 9.5 then implies
b2
log LNr ∼ T1 − .
2
It follows that under H0
b2 b2
(T, log LNr ) ∼ T, T1 − ∼ T, T2 − .
2 2
then from Le Cam’s third lemma, it will follow that under the alternative
T ∼ N (σ12 , 1) .
In view of the Cramér-Wold device (Section 2.1.2), it remains to show that for
arbitrary constants c1 , c2
c1 T + c2 T2 ∼ N ormal.
For that purpose we make use of Hoeffding’s combinatorial (see Section 3.3) central limit
theorem. Let
∗
j
¯
aij = c1 aij + c2 di − d ϕ ,f .
Nr + 1
Then
Nr
c1 T + c2 T2 = a∗iπ(i) .
i=1
Put
Since ϕ (v, f ) is an integrable function, there exists a constant M > 0 such that
|ϕ (v, f ) | ≤ M, a.s.. Also,
n Δ
|di − d| ≤ max Δk −
¯ k k
1≤k≤r Nr
tδk
≈ Nr−1/2
max δk −
1≤k≤r Nr
≈ O Nr−1/2 .
201
9. Efficiency
Hence
j
∗ ¯
|d (i, j)| ≤ |c1 d (i, j)| + c2 di − d ϕ , f − ϕ̄ (., f )
N +1
; r ;
≤ |c1 | max |d (i, j)| + |c2 M | di − d¯;
;
1≤i,j≤Nr
≈ O (Nrp ) + O Nr−1/2 ≈ o (1) .
1 ∗
V ar0 {c1 T + c2 T2 } = (d (i, j))2
Nr − 1
→ c2
and
max1≤i,j≤Nr (d∗ (i, j))2
∗ →0
1
Nr −1
(d (i, j))2
This completes the proof.
Applying Theorem 9.6 with p = − 12 , we see that Hamming’s statistic Hr is, under
the alternative, asymptotically normal with mean
mH = 1 + E0 [Hr T2 ]
r Wk
−1/2
E0 [Hr T2 ] ≈ N δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 Wk−1
202
9. Efficiency
⎧ ⎫
⎨ ⎬
1
r
Nkr
j
= Δk − d¯ ϕ , f − ϕ̄ (., f )
Nr − 1 k=1 ⎩ N r +1 ⎭
j=Nk−1 +1
r
r Wk
≈ N − /2
1
δk − w k δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 j=1 Wk−1
r Wk
= N − /2
1
δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 Wk−1
We may specialize these results for the two-sample case. In particular, we may
calculate for r = 2 and sample sizes n1 , n2 with n1n+n
1
2
→ λ,
−1/2
bI −1/2(f ) λ 1 λ
mH = 1 + ϕ (v, f ) dv − ϕ (v, f ) dv ,
n1 + n2 1−λ λ 0
and
2 1
σH = .
n1 + n2 − 1
It now follows that the asymptotic relative efficiency for the Hamming distance is
given by ( )2
r Wk
k=1 δ k Wk−1
[ϕ (v, f ) − ϕ̄ (f )] dv
ARE = ( r ). (9.21)
r 2
(r − 1) I (f ) w
k=1 k (δ k − w
k=1 k k δ )
203
9. Efficiency
On the other hand, suppose that the maximum asymptotic power can be reached by the
test based on χ2r−1 (b2 ) . Hence the asymptotic relative efficiency can be defined to be
approximately
2 2 n/2
P χ2k (δ 2 ) ≥ χ2α,n − α χα,r−1 − χ2α,n Γ r+2 χα,n δ
2 ≈ 2 (r−1−n)/2
exp 2
n+2 .
P χk (b2 ) ≥ χ2α,n − α 2 Γ 2 χ2α,r−1
(r−1)/2 b2
(9.22)
In the case where r = 2,
b2 = I (f ) w1 w2 (δ1 − δ2 )2 .
Theorem 9.7. The Spearman statistic for the multi-sample unordered problem is asymp-
totically chi-square χ2r−1 (δS2 ) with noncentrality parameter
1 2
δS2 = 12b 2
vϕ (v, f ) dv /I (f ) .
0
Theorem 9.8. The Hamming statistic for the multi-sample unordered problem is asymp-
totically chi-square χ2(r−1)2 (δH
2
) with noncentrality parameter
2 −1
δH = E [μ̄] ΣH E [μ̄] ,
Theorems 9.7 and 9.8 permit us to calculate the asymptotic relative efficiencies in
the unordered case. It can be shown that these are the same as for the ordered situation
and hence we have the same results as above.
204
9. Efficiency
9.4. Exercises
Exercise 9.1. Show that for the two-sample problem the asymptotic relative efficiency
for the normal density,
1 x2
f (x) = √ exp − 2 , −∞ < x < ∞, I (f ) = σ −2
2πσ 2σ
is ARE = 3
π
≈ 0.9554.
Exercise 9.2. Show that for the two-sample problem the asymptotic relative efficiency
for the double exponential,
1
f (x) = exp (− |x|) , −∞ < x < ∞, I (f ) = 1
2
3
is ARE = 4
= 0.75.
Example 9.4. Show that for the two-sample problem the asymptotic relative efficiency
for the logistic distribution,
−2
f (x) = e−x 1 + e−x , −∞ < x < ∞, I (f ) = 1/3
is ARE = 1.
205
Part III.
Selected Applications
207
10. Multiple Change-Point
Problems
10.1. Introduction
In the classical formulation of the single change-point problem, there is a sequence
X1 , . . . , Xn of independent continuous random variables such that the Xi for i ≤ τ have
a common distribution function F1 (x) and those for i > τ a common distribution F2 (x).
It is of interest to test the hypothesis of “no change,” i.e., τ = n against the alternative
of a change, 1 ≤ τ < n.
We begin by formulating the problem in terms of a parametric framework and study
the properties of the new model. We then construct a composite likelihood function
which permits us to conduct tests of hypotheses based on a score statistic to assess
the significance of the change-points. We demonstrate the consistency of the estimated
change-point locations and present a binary segmentation algorithm to search for the
multiple change-points. We then report on a number of simulation experiments in order
to compare the performance of the proposed method with other methods in the literature.
We apply the new method to detect the DNA copy number alterations in a human
genomic data set and to identify the change-points on an interest rate time series. Our
empirical results reveal that the proposed method is efficient for change-point detection
even when the data are serially correlated.
where f0 (t) is the density of T under the null hypothesis and K (θ) is a normalizing
constant. This is an example of exponential tilting where the first factor in (10.1)
represents the alternative to the null hypothesis. Consider the specific case where the
kernel is given by
h (x, y) = sgn (x − y) . (10.2)
When θ = 0, there is no change-point and sgn (x − y) = ±1 with equal probability 1
2
irrespective of the underlying common distribution of X and Y . Hence
1
f0 (t) = , t = ±1.
2
and the normalizing constant K(θ) is calculated to be
K(θ) = ln(cosh(θ)).
H0 : θ = 0
and in that case K (0) = 0. The kernel in (10.2) appears in the Mann–Whitney statistic
when testing for a change in mean between two distributions. The use of different kernel
functions allows us flexibility in measuring the change between two segments.
"
τ "
N
L(θ; X τ , Y τ ) = fT (tij ; θ)
i=1 j=τ +1
210
10. Multiple Change-Point Problems
using the density in (10.1). Hence the composite log-likelihood, apart from a constant,
is given by
τ
N
(θ; X τ , Y τ ) = θ sgn(zi − zj ) − τ (N − τ )K(θ). (10.3)
i=1 j=τ +1
1
τ
N
θ̂(τ ) = tanh−1 sgn(zi − zj ) . (10.4)
τ (N − τ ) i=1 j=τ +1
τ̂ = argmax (θ̂(τ ); X τ , Y τ ).
τ
In the following lemma, we prove the almost sure convergence of this statistic.
Um,n a.s.
−→ μXY .
mn
Proof. Using the result in (Lehmann (1975), p. 335),
V ar(Um,n ) = V ar(Sij ) + Cov(Sij , Skl )
i,j ijk
= mnV ar(Sij ) + mn (m − 1) Cov(Sij , Skj ) + mn (n − 1) Cov(Sij , Si )
≤ mn(m + n − 1)M,
where M is an upper bound for V ar(Sij ). It follows from Chebyshev’s inequality that
for each > 0 we have as min{m, n} −→ ∞,
1 2M
P (|Um,n − E(Um,n )| > mn) ≤ 2 2 2 V ar(Um,n ) = O →0
mn min{m, n}2
P
and hence Um,n − E(Um,n ) −→ 0. We have for subsequences {m2 } , {n2 },
2M
P |Um2 ,n2 − E(Um2 ,n2 )| > m2 n2 ≤ O < ∞.
m n m n
mn min{m, n}2
211
10. Multiple Change-Point Problems
Dm,n = max |Uk1 ,k2 − E(Uk1 ,k2 ) − (Um2 ,n2 − E(Um2 ,n2 ))| ,
where the maximum is taken over m2 ≤ k1 < (m + 1)2 ,n2 ≤ k2 < (n + 1)2 . From
Chebyshev’s inequality it also follows similarly that
1 64M
P Dm,n > m n i.o. ≤ 4 4 2 V ar(Dm,n ) ≤ O
2 2
,
mn (min{m, n})2 2
and hence
Dm,n a.s.
−→ 0. (10.6)
m 2 n2
Note that for m2 ≤ k1 < (m + 1)2 and n2 ≤ k2 < (n + 1)2 ,
and the proof of almost sure convergence then follows by using (10.5) and (10.6).
212
10. Multiple Change-Point Problems
The purpose of the scaling in (10.7) is to give consistent estimates of the change-point
locations (see Section 10.3). In searching for K change-points, this binary segmentation
procedure has a computation cost of order O(KN ln N ) as compared to a standard grid
search procedure with a computation cost of O(K2N ). We can further speed up the
procedure by searching for the change-points for the segments in parallel.
U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) )2
,
I(0)
where U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) and I(0) are the score function and Fisher information respec-
tively evaluated at θ = 0:
τ̂ (i∗ )
∂(θ(τ̂ (i∗ )); X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) τ̂i∗
U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) = = sgn(zi − zj ),
∂θ θ=0 ∗
i=τ̂i∗ −1 +1 j=τ̂ (i )+1
2
∂ (θ(τ̂ (i )); X τ̂ (i∗ ) , Y τ̂ (i∗ ) )
∗
I(0) = − E = mi∗ ni∗ K (0) = mi∗ ni∗ .
∂θ2 θ=0
We reject H0 for large values of the test statistic. For a fixed change-point location, the
test statistic has an asymptotic chi-square distribution with 1 d.f. under H0 . However,
the maximum of the test statistics among all the possible change-point locations does
not follow a chi-square distribution. We may instead compute the p-value of the test
statistic through a permutation test. Since under H0 , the X s and Y s are independent
and identically distributed, we can permute the observations in the segment and calculate
the test statistic for each permutation sample. Usually we can calculate the exact p-value
213
10. Multiple Change-Point Problems
τ N
τ (N − τ ) 1 1
Q(X τ , Y τ ) = ŵ tanh−1 (ŵ) + ln(1 − ŵ2 ) , ŵ = sgn(zi − zj ).
N 2 τ (N − τ ) i=1 j=τ +1
The following lemma shows that τ̂N is a strongly consistent estimator for a single change-
point location. Here we assume that μXY = E (sgn(X − Y )) < 1. Otherwise, it is trivial
that X is always greater than Y .
Lemma 10.2. Suppose there is a single change-point. Let γ be the true proportion of
observations belonging to the segment defined under model (10.1). Let {δN } be a sequence
of positive numbers such that δN → 0, N δN → ∞, as N → ∞. Then for N large enough,
γ ∈ [δN , 1 − δN ] and for all > 0,
τ̂
P lim γ − < = 1.
N
N →∞ N
Proof. For any γ̃ ∈ [δN , 1 − δN ], let X(γ̃) = {Z1 , . . . , Zγ̃N
} and Y (γ̃) = {Zγ̃N
+1 , . . . ,
ZN }. Then as N −→ ∞,
a.s. γ 1−γ
ŵ −→ I(γ̃ ≥ γ) + I(γ̃ < γ) μXY = w(γ̃),
γ̃ 1 − γ̃
214
10. Multiple Change-Point Problems
1 a.s.
Q(X(γ̃), Y (γ̃)) −→ γ̃(1 − γ̃)h(w(γ̃)),
N
uniformly in γ̃, where h(a) = a tanh−1 (a) + 12 ln(1 − a2 ). Notice |w(γ̃)| < 1 and applying
the Taylor series expansion of h at a = 0, there exists a large K such that with a = w(γ̃),
a 2 a4 a2K
h(a) = + + ··· + ,
2 12 2K(2K − 1)
K 2k
μ2k γ (1 − γ̃) (1 − γ)2k γ̃
γ̃(1 − γ̃)h(w(γ̃)) = XY
I(γ̃ ≥ γ) + I(γ̃ < γ) .
k=1
2k(2k − 1) γ̃ 2k−1 (1 − γ̃)2k−1
(10.9)
Lemma 10.3. Consider any change-point γ̃ such that γ (1) ≤ γ̃ ≤ γ (2) . Then the se-
quence of N observations is partitioned into 2 segments X(γ̃) = {Z1 , . . . , Zγ̃N
} and
N
Y (γ̃) = {Zγ̃N
+1 , . . . , ZN }. Let ŵ = γ̃N
(N1−γ̃N
) γ̃N
i=1 j=γ̃N
+1 sgn(zi − zj ). As
N −→ ∞,
a.s.
sup |ŵ − p(γ̃)| −→ 0,
γ̃∈[γ (1) ,γ (2) ]
where
215
10. Multiple Change-Point Problems
uniformly in γ̃.
Then
τ̂N a.s.
d , AN −→ 0, as N → ∞.
N
aγ̃+b
Proof. Note that p(γ̃) in (10.10) can be rewritten in a form: p(γ̃) = γ̃(1−γ̃) for some a
and b, and |p(γ̃)| < 1. Applying the Taylor series expansion of h at a = 0, there exists
a large K such that
q(γ̃) = γ̃(1 − γ̃)h(p(γ̃))
can be approximated by
K
μ2k (aγ̃ + b)2k
XY
q(γ̃) = .
k=1
2k(2k − 1) (γ̃(1 − γ̃))2k−1
It is easy to show that q(γ̃) is continuously differentiable and strictly convex as the sums
and products of convex functions are also convex.
Hence, for any two points γ̃1 , γ̃2 ∈ [γ (1) , γ (2) ], there exists a c > 0 such that
The rest of the proof proceeds analogously to the proof of Theorem 2 of Matteson and
James (2014).
Finally, the extension of the consistency proof for multiple change-points is straight-
forward by noting that in the case of multiple change-points, p(γ̃) in (10.10) is still
aγ̃+b
represented in the form: γ̃(1−γ̃) .
216
10. Multiple Change-Point Problems
11
Zi = hj J(N ti − τj ) + σεi J(x) = {1 + sgn(x)} /2
j=1
{τj /N } = {0.1, 0.13, 0.15, 0.23, 0.25, 0.40, 0.44, 0.65, 0.76, 0.78, 0.81}
{hj } = {2.01, −2.51, 1.51, −2.01, 2.51, −2.11, 1.05, 2.16, −1.56, 2.56, −2.11}
where the ti ’s are equally spaced in [0, 1] and {τj /N } marks the segment’s information.
The position of each change-point is τj and the mean difference is controlled by hj .
Various i.i.d. distributions for εi are considered in our simulation experiments:
N (0, 1): standard normal distribution, t(2): Student’s t distribution with two degrees
of freedom, χ2 (3): chi-squared distribution with three degrees of freedom, Cauchy(0, 1):
Cauchy distribution with location 0 and scale 1, P areto(α = 0.5), P areto(α = 1.5):
Pareto distributions with α = 0.5, and 1.5 and LN (0, 1): log-normal distribution with
location 0 and scale 1. In our simulation, N is set to be 500 and 1000 with σ = 0.5. Ex-
amples of simulated model with various error distributions can be found in Figure 10.1.
Note that ξ1 (Γ̂est , Γtrue ) measures the under-segmentation error and ξ2 (Γtrue , Γ̂est ) mea-
sures the over-segmentation error. We also calculate the sum of ξ1 and ξ2 as a measure
of the total change-point location error.
217
10. Multiple Change-Point Problems
2
Model with Normal(0,1) error Model with (3) error
5 10
Simulated Data
Simulated Data
4 8
3
6
2
4
1
2
0
-1 0
-2 -2
0 100 200 300 400 500 0 100 200 300 400 500
Position Position
Simulated Data
20
4
10
2 0
-10
0
-20
-2 -30
0 100 200 300 400 500 0 100 200 300 400 500
Position Position
218
Table 10.1.: Simulation Results for known number of change-points (11). Standard deviations are given in parentheses.
Number in bold represents the method with the smallest total error.
219
1.50
500 24.40(17.59) 94.80(52.51) 94.30(47.92) 18.80(7.69) 23.45(18.67) 46.10(22.36) 47.75(23.57) 29.30(19.67) 47.85 140.90 142.05 48.10
Cauchy(0, 1)
1000 16.00(18.55) 206.95(115.54) 169.30(91.42) 35.00(19.73) 18.50(35.10) 107.10(51.04) 96.15(44.65) 55.60(44.75) 34.50 314.05 265.45 90.60
500 8.30(7.83) 23.95(11.46) 40.65(23.31) 4.85(4.36) 8.75(9.88) 20.15(15.76) 48.65(26.19) 5.65(6.18) 17.05 44.10 89.30 10.50
t(2)
1000 5.95(5.28) 31.40(23.22) 77.95(42.44) 3.75(2.02) 5.95(5.28) 43.40(59.78) 81.60(44.38) 3.75(2.02) 11.90 74.80 159.55 7.50
500 11.40(11.31) 75.65(50.02) 81.30(25.61) 16.25(10.50) 11.10(9.91) 45.25(27.30) 53.20(24.22) 28.85(26.82) 22.5 120.9 134.5 45.1
P areto(α = 0.5)
10. Multiple Change-Point Problems
10.5. Applications
10.5.1. Array CGH Data
DNA copy numbers refer to the number of copies of genomic DNA in a human. The
usual number is two for normal cells for all the non-sex chromosomes. Variations are
indicative of disease such as cancer. Hence, there is a need to produce copy number
maps. The array data used consist of the log2 -ratio of normalized intensities of the
red/green channels indexed by marker locations on a chromosome where the red and
green channels measure the intensities of the cancer and normal samples respectively.
There have been a number of different techniques proposed for analyzing copy number
variation (CNV) data. It is known that CNVs account for an abundance of genetic
variation and may influence phenotypic differences.
The change-point detection method above was applied to the array CGH data (Array
Comparative Genomic Hybridization) with experimentally tested DNA copy number
alterations by Snijders et al. (2001). This data can be downloaded from http://www.
nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html. These array data sets consist
220
10. Multiple Change-Point Problems
221
10. Multiple Change-Point Problems
Log2ration array for Cell lines T47D and Changepoint found by our method
1.5
0.5
log 2 (T/R)
-0.5
-1
-1.5
-2
-2.5
0 500 1000 1500 2000
Position
Log2ration array for Cell lines T47D and Changepoint found by e.divisive
1.5
0.5
0
log 2 (T/R)
-0.5
-1
-1.5
-2
-2.5
0 500 1000 1500 2000
Position
Log2ration array for Cell lines T47D and Changepoint found by PLR(PELT)
1.5
0.5
0
log 2 (T/R)
-0.5
-1
-1.5
-2
-2.5
0 500 1000 1500 2000
Position
Figure 10.2.: Log2ration array for Cell lines T47D and change-points found by our
method, e.divisive and PLR(PRLT). Vertical dotted lines are the change-
points found by these method.
of single experiments on 15 human cell strains and 9 other cell lines. Each cell line is
divided by chromosomes. Each array contains measurements for approximately 2200
BACs (bacterial artificial chromosomes) spotted in triplicates. The variable used for
analysis is the normalized average of the log 2 test over reference ratio.
We first apply our method, as well as the e.divisive and PRL(PELT) methods to one
of the cell lines labeled T47D with sample size N = 2295. To determine the number of
change-points, our method uses the 5% significance level. The parametric likelihood ap-
proach (PLR) with PELT algorithm uses BIC penalty in the R package “change-point.”
The e.divisive method uses α = 1, minimum segment length=2, and 5% significance
level. Figure 10.2 flags the change-points found by these three methods. We can see
that our method successfully locates most of the change-points especially those short
but significantly distinct segments. Note that our algorithm won’t flag any single ex-
treme data point as a segment although our minimum length of a segment is set to be
1. Comparing with the e.divisive method, our method found 51 change-points while
222
10. Multiple Change-Point Problems
223
10. Multiple Change-Point Problems
log 2 (T/R)
log 2 (T/R)
0.4
0.2
0.2
0 0
-0.2 -0.2
0 20 40 60 80 0 20 40 60 80
Position Position
GM01535 Chromosome 12
0.5
log 2 (T/R)
0
-0.5
-1
0 20 40 60 80
Position
log 2 (T/R)
0.4
0.2
0.2
0 0
-0.2 -0.2
0 20 40 60 80 100 0 20 40 60
Position Position
GM03134 Chromosome 8
1
0
log 2 (T/R)
-1
-2
-3
0 50 100 150
Position
log 2 (T/R)
0
0.4
0.2
-0.5
0
-0.2 -1
0 20 40 60 80 0 20 40 60 80 100
Position Position
GM05296 Chromosome 10
0.6
log 2 (T/R)
0.4
0.2
0
-0.2
0 20 40 60 80 100 120
Position
log 2 (T/R)
0 0.5
-0.5 0
-1 -0.5
0 50 100 150 0 50 100 150
Position Position
GM07081 Chromosome 15
0.4
0.2
log 2 (T/R)
-0.2
-0.4
0 10 20 30 40 50 60 70
Position
Figure 10.3.: The chromosomes with identified alterations and the change-points found
by our method and e.divisive. Our results are shown by red dotted line.
The results of e.divisive are shown by blue dashed line. The blue-red dash-
dot line means the two methods find same change-point.
224
10. Multiple Change-Point Problems
Real Interest rate and change points found by our method 1980:1-2015:6
10
-5
-10
1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year
Figure 10.4.: The real interest rate data and change-points found by our method
Table 10.3.: Four recession periods and the positions of change-point found by our
method, e.divisive and PLR(PLET). Numbers in bold refer to the method
which successfully finds the starting point or the end point of the recession
period.
Position of change-point found
Four recessions period
Our method e.divisive PLR(PLET)
1980:1–1980:11 1980:11 1980:11 1980:10
1990:7–1991:3 1990:7 1990:8 1990:7
2001:3–2001:11 2001:3 2001:3 2001:3
2007:12–2009:6 2007:12 2007:10 2007:10
From the first subgraph (1980:1–1984:12) in Figure 10.5, our method and e.divisive
both successfully find one change-point at the end of the recession period in 1980:11.
However, PLR(PELT) method falsely overestimates the number of change-points and
falsely detects many change-points at the beginning of the data (during the period
1980–1981). This may be caused by the great variation and outliers at the beginning of
the data set. From the second subgraph (1989:1–1993:12) in Figure 10.5, our method
and PLR(PELT) successfully locate the change-point at the beginning of the recession
period in 1990:7, while e.divisive method locates the change-point a bit late in 1990–1998.
From the third subgraph (2000:1–2004:12) in Figure 10.5, all three methods successfully
locate the change-point at the beginning of the recession period in 2001:03. From the
fourth subgraph (2007:1–2011:12) in Figure 10.5, only our method successfully locates
the change-point at the beginning of the recession period in 2007:12, while the other two
methods find the change-point two months early due to the sudden fall of the interest
rate data. Finally, we conclude that our method successfully locates the change-points
over all four recession periods.
225
10. Multiple Change-Point Problems
-5
-10
1980 1982 1984
Year
-2
1989 1991 1993
Year
-2
-4
2000 2002 2004
Year
-2
-4
2007 2009 2011
Year
Figure 10.5.: The real interest rate data around four recession periods. Vertical lines
are the change-points found by our method, e.divisive and PLR(PRLT).
Red star points are the periods when U.S. business Recessions actually
happened defined by US National Bureau of Economic Research. Our
results are shown by red dotted line. The results of e.divisive are shown
by blue dashed line. The results of PLR(PRLT) are shown by purple thin
line. When different type lines coincide, it means different methods find
same change-point. Details of the position found can be seen in Table 10.3.
226
10. Multiple Change-Point Problems
Chapter Notes
227
11. Bayesian Models for Ranking
Data
Ranking data are often encountered in practice when judges (or individuals) are asked
to rank a set of t items, which may be political goals, candidates in an election, types of
food, etc. We see examples in voting and elections, market research, and food preference
just to name a few. By studying ranking data, we can understand the judges’ perception
and preferences on the ranked alternatives.
Let R = (R(1), . . . , R(t)) be ranking t items, labeled 1, . . . , t. It will be more
convenient to standardize the rankings as:
1 2
t+1 t(t − 1)
y = R− 1 / ,
2 12
√
where y is the t × 1 vector with y ≡ y y = 1. We consider the following ranking
model:
π(y|κ, θ) = C(κ, θ) exp {κθ y} ,
where the parameter θ is a t × 1 vector with θ = 1, parameter κ ≥ 0, and C(κ, θ)
is the normalizing constant. In the case of the distance-based models (Alvo and Yu,
2014), the parameter θ can be viewed as if a modal ranking vector. In fact, if R and μ0
represent an observed ranking and the modal ranking of t items respectively, then the
probability of observing R under the Spearman distance-based model is proportional to
t
1 2 t (t + 1) (2t + 1)
exp −λ (R (i) − μ0 (i)) = exp −λ − μ0 R
2 i=1 12
∝ exp {κθ y} ,
2
where κ = λ t(t12−1) , and y and θ are the standardized rankings of R and μ0 , respectively.
However, the μ0 in the distance-based model is a discrete permutation vector of integers
{1, 2, . . . , t} but the θ in our model is a real-valued vector, representing a consensus
view of the relative preference of the items from the individuals. Since both θ = 1
and y = 1, the term θ y can be seen as cos φ where φ is the angle between the
0.5
0
N
-0.5
-1
-1
-0.5
0 -1
-0.5
0.5 0
0.5
y 1 1 x
Figure 11.1.: Illustration for the angle between the consensus score vector θ = (0, 1, 0)
and the standardized observation of (1, 2, 3) on the sphere when t = 3.
consensus score vector θ and the observation y. Figure 11.1 illustrates an example
of the angle between the consensus score vector θ = (0, 1, 0) and the standardized
observation of R = (1, 2, 3) on the sphere for t = 3. The probability of observing a
ranking is proportional to the cosine of the angle from the consensus score vector. The
parameter κ can be viewed as a concentration parameter. For small κ, the distribution
of rankings will appear close to a uniform whereas for larger values of κ, the distribution
of rankings will be more concentrated around the consensus score vector. We call this
new model as angle-based ranking model.
To compute the normalizing constant C(κ, θ), let Ρt be the set of all possible per-
mutations of the integers 1, . . . , t. Then
+ ,
(C(κ, θ))−1 = exp κθ T y . (11.1)
y∈Ρt
Notice that the summation is over the t! elements in Ρt . When t is large, says greater
than 15, the exact calculation of the normalizing constant is prohibitive. Using the fact
that the set of t! permutations lie on a sphere in (t − 1)-space, our model resembles
the continuous von Mises-Fisher distribution, abbreviated as vM F (x|m, κ), which is
defined on a (p − 1) unit sphere with mean direction m and concentration parameter κ:
230
11. Bayesian Models for Ranking Data
where p
κ 2 −1
Vp (κ) = p ,
(2π) 2 I p2 −1 (κ)
and Ia (κ) is the modified Bessel function of the first kind with order a. Consequently,
we may approximate the sum in (11.1) by an integral over the sphere:
t−3
κ 2
C(κ, θ) Ct (κ) = t−3 ,
2 2 t!I t−3 (κ)Γ( t−1
2
)
2
where Γ(.) is the gamma function. Table 11.1 shows the error rate of the approximate log-
normalizing constant as compared to the exact one computed by direct summation. Here,
κ is chosen to be 0.01 to 2 and t ranges from 4 to 11. Note that the exact calculation of the
normalizing constant for t = 11 requires the summation of 11! ≈ 3.9 × 107 permutation.
The computer ran out of memory (16GB) beyond t = 11. This approximation seems to
be very accurate even when t = 3. The error drops rapidly as t increases. Note that this
approximation allows us to approximate the first and second derivatives of log C which
can facilitate our computation in what follows.
Notice that κ may grow with t as θ y is a sum of t terms. It can be seen from the
applications in Section 11.4 that in one of the clusters for the APA data (t = 5), κ is
7.44(≈ 1.5t) (see Table 11.4). We thus compute the error rate for κ = t and κ = 2t as
shown in Figure 11.2. It is found that the approximation is still accurate with error rate
of less than 0.5% for κ = t and is acceptable for large t when κ = 2t as the error rate
decreases in t.
The error rate for Approximate log Normalizing constant
1.6
κ=2t
κ=t
1.4
1.2
Error rate (%)
0.8
0.6
0.4
0.2
3 4 5 6 7 8 9 10 11
t
Figure 11.2.: The error rate of the approximate log-normalizing constant as compared
to the exact one computed by direct summation for κ = t and κ = 2t.
231
Table 11.1.: The error rate of the approximate log-normalizing constant as compared to the exact one computed by direct
summation.
t
κ 3 4 5 6 7 8 9 10 11
0.01 <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001%
232
0.1 <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001%
0.5 0.00003% 0.00042% 0.00024% 0.00013% 0.00007% 0.00004% 0.00003% 0.00002% 0.00001%
0.8 0.00051% 0.00261% 0.00150% 0.00081% 0.00046% 0.00027% 0.00017% 0.00011% 0.00008%
1 0.00175% 0.00607% 0.00354% 0.00194% 0.00110% 0.00066% 0.00041% 0.00027% 0.00018%
2 0.05361% 0.06803% 0.04307% 0.02528% 0.01508% 0.00932% 0.00598% 0.00398% 0.00273%
11. Bayesian Models for Ranking Data
11. Bayesian Models for Ranking Data
N
l(κ, θ) = N log Ct (κ) + κθ y i (11.2)
i=1
Maximizing (11.2) subject to θ = 1 and κ ≥ 0, we find that the maximum likelihood
N
yi
estimator of θ is given by θ̂ M LE = i=1 , and κ̂ is the solution of
Ni=1 yi
; ;
; N ;
−Ct (κ) I t−1 (κ) ; y
i=1 i ;
At (κ) ≡ = 2 = ≡ r. (11.3)
Ct (κ) I t−3 (κ) N
2
At (κi ) − r
κi+1 = κi − , i = 0, 1, 2, . . . .
1 − At (κi )2 − t−2
κi
At (κ i )
233
11. Bayesian Models for Ranking Data
; N ;
−1 ; ;
where m = β0 m0 + N y
i=1 i β , β = ; β0 m 0 + y
i=1 i ;. The posterior density can
be factored as
p(κ, θ|Y ) = p(θ|κ, Y )p(κ|Y ) (11.5)
where p(θ|κ, Y ) ∼ vM F (θ|m, βκ) and
t−3
[Ct (κ)]N +ν0 κ 2 (υ0 +N ) I t−2 (βκ)
p(κ|Y ) ∝ = ν0 +N
2
.
Vt (βκ) t−2
I t−3 (κ) (βκ) 2
2
The normalizing constant for p(κ|Y ) is not available in closed form. Nunez-Antonio and
Gutiérrez-Pena (2005) suggested using a sampling-importance-resampling (SIR) proce-
dure with a proposal density chosen to be the gamma density with mean κ̂M LE and
variance equal to some prespecified number such as 50 or 100. However, in a simulation
study, it was found that the choice of this variance is crucially related to the performance
of SIR. An improper choice of variance may lead to slow or unsuccessful convergence.
Also the MCMC method leads to intensive computational complexity. Furthermore,
when the sample size N is large, βκ can be very large which complicates the compu-
tation of the term I t−2 (βκ) in Vt (βκ). Thus the calculation of the weights in the SIR
2
method will fail when N is large. We conclude that in view of the difficulties for directly
sampling from p(κ|Y ), it may be preferable to approximate the posterior distribution
with an alternative method known as variational inference (abbreviated VI from here
on).
p(κ, θ) = p(θ|κ)p(κ)
= vM F (θ|m0 , β0 κ)Gamma(κ|a0 , b0 ),
where Gamma(κ|a0 , b0 ) is the Gamma density function with shape parameter a0 and
rate parameter b0 (i.e., mean equal to ab00 ), and p(θ|κ) = vM F (θ|m0 , β0 κ). The choice
of Gamma(κ|a0 , b0 ) for p(κ) is motivated by the fact that for large values of κ, p(κ)
in (11.4) tends to take the shape of a Gamma density. In fact, for large values of κ,
κ
I t−3 (κ) √e2πκ , and hence p(κ) becomes the Gamma density with shape (ν0 − 1) t−22
+1
2
234
11. Bayesian Models for Ranking Data
and rate ν0 − β0 :
This can be shown to be equivalent to maximizing the evidence lower bound (ELBO)
(Blei et al., 2017). So the optimization of the variational factors q(θ|κ) and q(κ) is
performed by maximizing the evidence lower bound L(q) with respect to q on the log-
marginal likelihood, which in our model is given by
p(Y |κ, θ)p(θ|κ)p(κ)
L(q) = Eq(θ,κ) ln (11.6)
q(θ|κ)q(κ)
= Eq(θ,κ) [f (θ, κ)] − Eq(θ,κ) [ln q(θ|κ)] − Eq(κ) [ln q(κ)] + constant,
where all the expectations are taken with respect to q(θ, κ) and
N
t−3
f (θ, κ) = κθ y i + N ln κ − N ln I t−3 (κ) + κβ0 m0 θ
2 2
i=1
t−2
+ ln κ − ln I t−2 (κβ0 ) + (a0 − 1) ln κ − b0 κ.
2 2
For fixed κ, the optimal posterior distribution ln q ∗ (θ|κ) is ln q ∗ (θ|κ) = κβ0 m0 θ +
N ∗
i=1 κθ y i + constant. We recognize q (θ|κ) as a von Mises-Fisher distribution
vM F (θ|m, κβ) where
; ;
;
N ; N
; ;
β = ;β0 m0 + y i ; and m = β0 m0 + y i β −1 .
; ;
i=1 i=1
Let g(κ) denote the remaining terms in f (θ, κ) − ln q(θ|κ) which only involve κ:
t−3
g(κ) = N + a0 − 1 ln κ − b0 κ − N ln I t−3 (κ) − ln I t−2 (κβ0 ) + ln I t−2 (κβ).
2 2 2 2
It is still difficult to maximize Eq(κ) [g(κ)] − Eq(κ) [ln q(κ)] since it involves the evaluation
of the expected modified Bessel function. Following the similar idea in Taghia et al.
235
11. Bayesian Models for Ranking Data
(2014), we first find a tight lower bound g(κ) for g(κ) so that
L(q) ≥ L(q) = Eq(κ) g(κ) − Eq(κ) [ln q(κ)] + constant.
From the properties of the modified Bessel function of the first kind, it is known that
the function ln Iν (x) is strictly concave with respect to x and strictly convex relative to
ln x for all ν > 0. Then, we can have the following two inequalities:
∂
ln Iν (x) ≤ ln Iν (x̄) + ln Iν (x̄) (x − x̄), (11.7)
∂x
∂
ln Iν (x) ≥ ln Iν (x̄) + ln Iν (x̄) x̄(ln x − ln x̄), (11.8)
∂x
∂
where ∂x ln Iν (x̄) is the first derivative of ln Iν (x) evaluated at x = x̄. Applying inequal-
ity (11.7) for ln I t−3 (κ) and inequality (11.8) for ln I t−2 (κβ0 ), we have
2 2
t−3
g(κ) ≥ g(κ) = N + a0 − 1 ln κ − b0 κ + ln I t−2 (βκ̄)
2 2
∂
+ ln I t−2 (βκ̄)βκ̄ (ln βκ − ln βκ̄) − N ln I t−3 (κ̄)
∂βκ 2 2
∂ ∂
−N ln I t−3 (κ̄) (κ − κ̄) − ln I t−2 (β0 κ̄) − ln I t−2 (β0 κ̄)β0 (κ − κ̄) .
∂κ 2 2 ∂β0 κ 2
Since the equality holds when κ = κ̄, we see that the lower bound of L(q) is tight.
Rearranging the terms, we have the approximate optimal solution as ln q ∗ (κ) = (a −
1) ln κ − bκ + constant, where
t−3 ∂
a = a0 + N + βκ̄ ln I t−2 (βκ̄) , (11.9)
2 ∂βκ 2
∂ ∂
b = b0 + N I t−3 (κ̄) + β0 ln I t−2 (β0 κ̄) . (11.10)
∂κ 2 ∂β0 κ 2
We recognize q ∗ (κ) to be a Gamma(κ|a, b) with shape a and rate b. Finally, the posterior
mode κ̄ can be obtained from the previous iteration as:
a−1
b
if a > 1,
κ̄ = a
(11.11)
b
otherwise.
236
11. Bayesian Models for Ranking Data
density
0.2
0.1 0.15
0.1
0.05
0.05
0 0
0 0.5 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Posterior κ Posterior κ
Bayesian−SIR, N=20 Bayesian−VI, N=20 Bayesian−SIR, N=100 Bayesian−VI, N=100
1 1
0 0 0 0
−1 −1
theta1 theta2 theta3 theta1 theta2 theta3 theta1 theta2 theta3 theta1 theta2 theta3
Posterior θ Posterior θ Posterior θ Posterior θ
Figure 11.3.: Comparison of the posterior distribution obtained by Bayesian SIR method
and the approximate posterior distribution by variational inference ap-
proach. The comparison is illustrated for different data sizes of N = 20
(left) and N = 100 (right).
237
11. Bayesian Models for Ranking Data
which can be achieved by Gibbs sampling based on the following two full conditional
distributions:
"
N
p(R∗1 , . . . , R∗N |RI1 , . . . , RIN , θ, κ) = p(R∗i |RIi , θ, κ),
i=1
"
N
p(θ, κ|R∗1 , . . . , R∗N ) ∝ p(θ, κ) p(R∗i |θ, κ).
i=1
Sampling from p(R∗1 , . . . , R∗N |RI1 , . . . , RIN , θ, κ) can be generated by using the
Bayesian SIR method or the Bayesian VI method which have been discussed in the
previous sections. More concretely, we need to fill in the missing ranks for each ob-
servation and for that we appeal to the concept of compatibility described in Alvo
and Yu (2014) which considers for an incomplete ranking, the class of complete
order preserving rankings. For example, suppose we observe one incomplete sub-
set ranking RI = (2, −, 3, 4, 1) . The set of corresponding compatible rankings is
+ ,
(2, 5, 3, 4, 1) , (2, 4, 3, 5, 1) , (2, 3, 4, 5, 1) , (3, 2, 4, 5, 1) , (3, 1, 4, 5, 2) .
Generally speaking, let Ω(RIi ) be the set of complete rankings compatible with
RIi . For an incomplete subset ranking with k out of t items being ranked, we will
have a total t!/k! complete rankings in its compatible set. Note that p(R∗i |RIi , θ, κ) ∝
p(R∗i |θ, κ), R∗i ∈ Ω(RIi ). Obviously, direct sampling from this distribution will be tedious
for large t. Instead, we use the Metropolis-Hastings algorithm to draw samples from this
238
11. Bayesian Models for Ranking Data
distribution with the proposed candidates generated uniformly from Ω(RIi ). The idea of
introducing compatible rankings allows us to treat different kinds of incomplete rankings
easily. It is easy to sample uniformly from the compatible rankings since we just need
to fill-in the missing ranks under different situations. In the case of top-k rankings,
the compatibility set will be defined to ensure that the unranked items receive rankings
larger than k. Note that it is also possible to use Monte Carlo EM approach to handle
incomplete rankings under a maximum likelihood setting where the Gibbs sampling is
used in the E-step (see Yu et al. (2005)).
11.4. Applications
11.4.1. Sushi Data Sets
We investigate the two data sets of Kamishima (2003) for finding the difference in food
preference patterns between eastern and western Japan. Historically, western Japan has
been mainly affected by the culture of the Mikado emperor and nobles, while eastern
Japan has been the home of the Shogun and Samurai warriors. Therefore, the preference
patterns in food are different between these two regions (Kamishima, 2003).
The first data set consists of complete rankings of t = 10 different kinds of sushi
given by 5000 respondents according to their preference. The region of respondents
is also recorded (N = 3285 for eastern Japan, 1715 for western Japan). We apply
the MLE, Bayesian-SIR, and Bayesian-VI on both eastern and western Japan data.
We chose non-informative priors for both Bayesian-SIR and Bayesian-VI. Specifically,
the prior parameter m0 is chosen uniformly whereas β0 , a0 , and b0 are chosen to be
small numbers close to zero. Since the sample size N is quite large compared to t, the
estimated models for all three methods are almost the same. Figure 11.4 compares the
posterior means of θ between eastern Japan (blue bar) and western Japan (red bar)
obtained by Bayesian-VI method. Note that the more negative value of θi means that
the more preferable sushi i is. From Figure 11.4, we see that the main difference for
sushi preference between eastern and western Japan occurs in salmon roe, squid, sea eel,
shrimp, and tuna. People in eastern Japan have a greater preference for salmon roe and
tuna than the western Japanese. On the other hand, the latter have a greater preference
for squid, shrimp, and sea eel. Table 11.2 shows the posterior parameter obtained by
Bayesian-VI. It can be seen that the eastern Japanese are slightly more cohesive than
western Japanese since the posterior mean of κ is larger.
The second data set contains incomplete rankings given by 5000 respondents who
were asked to pick and rank some of the t = 100 different kinds of sushi according
to their preference, and most of them only selected and ranked the top 10 out of 100
sushi. Figure 11.5 compares the box-plots of the posterior means of θ between eastern
Japan (blue box) and western Japan (red box) obtained by Bayesian-VI. The posterior
239
11. Bayesian Models for Ranking Data
Sushi data (Complete rankings t=10), Eastern VS Western Japan (by Bayesian−VI)
Cucumber
Egg
Squid
Tuna roll
Sea urchin
Sea eel
Shrimp
Salmon roe
Tuna Eastern Japan
Fatty tuna Western Japan
Figure 11.4.: Posterior means of θ for the sushi complete ranking data (t = 10) in eastern
Japan (blue bar) and western Japan (red bar) obtained by Bayesian-VI.
Table 11.2.: Posterior parameters for the sushi complete ranking data (t = 10) in eastern
Japan and western Japan obtained by Bayesian-VI.
Posterior Parameter Eastern Japan Western Japan
β 1458.85 741.61
a 18509.84 9462.70
b 3801.57 2087.37
Posterior Mean of κ 4.87 4.53
distribution of θ is based on the Gibbs samplings after dropping the first 200 samples
during the burn-in period. Since there are too many kinds of sushi, this graph doesn’t
allow us to show the name of each sushi. However, we can see that about one-third of
the 100 kinds of sushi have fairly large posterior means of θi and their values are pretty
close to each other. This is mainly because these sushi are less commonly preferred by
Japanese and the respondents hardly chose these sushi in their list. As these sushi are
usually not ranked as top 10, it is natural to see that the posterior distributions of their
θi ’s tend to have a larger variance.
From Figure 11.5, we see that there exists a greater difference between eastern and
western Japan for small θi ’s. Figure 11.6 compares the box-plots of the top 10 smallest
posterior means of θ between eastern Japan (blue box) and western Japan (red box).
The main difference for sushi preference between eastern and western Japan appears to
be in sea eel, salmon roe, tuna, sea urchin, and sea bream. The eastern Japanese prefer
salmon roe, tuna, and sea urchin sushi more than the western Japanese, while the latter
like sea eel and sea bream more than the former. Generally speaking, tuna and sea
urchin are more oily food, while salmon roe and tuna are more seasonal food. So from
the analysis of both data sets, we can conclude that the eastern Japanese usually prefer
more oily and seasonal food than the western Japanese (Kamishima, 2003).
240
11. Bayesian Models for Ranking Data
Sushi data (Incomplete ranking t=100),Posterior ditribution of θ, Eastern VS Western Japan (by Bayesian−VI)
Figure 11.5.: Box-plots of the posterior means of θ for the sushi incomplete rankings
(t = 100) in eastern Japan (blue box-plots) and western Japan (red box-
plots) obtained by Bayesian-VI.
Sushi data (Incomplete ranking t=100), Top 10 smallest θ, Eastern VS Western Japan (by Bayesian−VI)
Sea bream
Sea urchin
Egg
Fatty tuna
Tuna
Salmon roe
Octopus
Squid
Shrimp
Sea eel
−0.26 −0.24 −0.22 −0.2 −0.18 −0.16 −0.14
Posterior θ
Figure 11.6.: Box-plots of the top 10 smallest posterior means of θ for the sushi incom-
plete rankings (t = 100) in eastern Japan (blue box-plots and blue circles
for outliers) and western Japan (red box-plots and red pluses for outliers)
obtained by Bayesian-VI.
241
11. Bayesian Models for Ranking Data
Table 11.3.: Deviance information criterion (DIC) for the APA ranking data.
G 1 2 3 4 5
DIC 54827 53497 53281 53367 53375
Table 11.4 indicates the posterior parameters for the three-cluster solution and Fig-
ure 11.7 exhibits the posterior means of θ for the three clusters obtained by Bayesian-VI.
It is very interesting to see that Cluster 1 votes clinical psychologists D and E as their
first and second choices and dislike especially the research psychologist C. Cluster 2
Table 11.4.: Posterior parameters for the APA ranking data (t = 5) for three clusters
obtained by Bayesian-VI.
Posterior Parameter Cluster 1 Cluster 2 Cluster 3
m 0.06 -0.44 0.26
0.02 0.19 0.14
0.78 -0.64 -0.75
-0.54 0.49 0.55
-0.33 0.39 -0.19
β 1067.10 1062.34 414.74
d 3231.09 1317.21 1189.72
a 4756.33 9224.97 1821.73
b 3330.45 1239.41 1197.80
Posterior mean of κ 1.43 7.44 1.52
Posterior mean of τ 56.31% 22.96% 20.73%
242
11. Bayesian Models for Ranking Data
prefers research psychologists A and C but dislikes the others. Cluster 3 prefers research
psychologist C. From Table 11.4, Cluster 1 represents the majority (posterior mean of
τ1 = 56.31%). Cluster 2 is small but more cohesive since the posterior mean of κ2 is
larger. Cluster 3 has a posterior mean of τ3 = 20.73% and κ3 is 1.52. The preferences of
the five candidates made by the voters in the three clusters are heterogeneous and the
mixture model enables us to draw further inference from the data.
Cluster 2
Cluster 3
−0.5 0 0.5
Posterior mean of θ
Figure 11.7.: Comparison of the posterior mean of θ for the APA data (t = 5) for three
clusters obtained by Bayesian-VI.
Chapter Notes
We proposed a new class of general exponential ranking model called angle-based ranking
models. The model assumed a consensus score vector θ where the rankings reflect the
rank-order preference of the items. The probability of observing a ranking is proportional
to the cosine of the angle from the consensus score vector. Unlike distance-based models,
the consensus score vector θ proposed exhibits detailed information on item preferences
while distance-based model only provide equal-spaced modal ranking. We applied the
method to sushi data and concluded that certain types of sushi are seldom eaten by
the Japanese. Our consensus score vector θ defined on a unit sphere can be easily
re-parameterized to incorporate additional arguments or covariates in the model. The
judge-specific covariates could be age, gender, and income, the item-specific covariates
could be prices, weights, and brands, and the judge-item-specific covariates could be
some personal experience on using each phone or brand. Adding those covariates into
the model could greatly improve the power of prediction of our model. We could also
develop Bayesian inference methods to facilitate the computation. Further details are
available in Xu et al. (2018).
243
12. Analysis of Censored Data
Censored data occur when the value of an observation is only partially known. For ex-
ample, it may be known that someone’s exact wealth is unknown but it may be known
that their wealth exceeds one million dollars. In left censoring, the data may fall below
a certain value whereas in right censoring, it may be above a certain value. Type I cen-
soring occurs when the subjects of an experiment are right censored. Type II censoring
occurs when the experiment stops after a certain number of subjects have failed; the re-
maining subjects are then right censored. Truncated data occur when observations never
lie outside a given range. For example, all data outside the unit interval is discarded.
A good example to illustrate the ideas occurs in insurance companies. Left truncation
occurs when policyholders are subject to a deductible whereas right censoring occurs
when policyholders are subject to an upper pay limit.
A major theme of this chapter is to demonstrate that the key to deriving fundamental
results with the embedding approach for censored data lies in an appropriate choice of
the parametric family. We first briefly introduce survival analysis and then provide an
overview of the developments of rank tests for censored data, highlighting the difficulties
caused by ranking incomplete data and describing important landmarks in overcoming
these difficulties. Next, we generalize the parametric embedding approach to give a
new derivation of what these landmark results have finally led to. More importantly,
coupled with the LAN and local minimaxity results, the approach introduced yields
asymptotically optimal tests for local alternatives in the embedded parametric family.
Since the actual alternatives are unknown, the problem of adaptive (data-dependent)
choice of the score function for rank tests has witnessed important developments. We
provide a brief review of this topic and its implications on the choice of the parametric
family in parametric embedding.
where F (t) is the distribution function of T assumed continuous. The survival function
is a nonincreasing function of time with the properties
S (0) = 1, S (∞) = 0.
Definition 12.1. The hazard function is defined as the limit of the probability that an
individual fails in a very short time interval given that he has survived up to time t:
f (t) S (t) d
h (t) = =− = − log {S (t)} . (12.1)
S (t) S (t) dt
The hazard function is also known as the instantaneous failure rate. One of the
characteristics of survival data is the existence of incomplete data, the most common
type of which is left truncation and right censoring.
Definition 12.2. Left truncation occurs when subjects enter a study at a specific time
and are followed henceforth until the event occurs or the subject is censored. Right
censoring occurs when a subject leaves the study before the event occurs or the study
ends before the event has occurred.
246
12. Analysis of Censored Data
Suppose that
t ∈ [tj , tj−1 ) , j = 1, 2, . . .
Then the probability of surviving beyond time t is
j
" di
Ŝ (t) = 1− (12.2)
i=1
ni
ni+1 = ni − di − ci , i = 0, 1, 2, ..
where Eg denotes expectation with respect to the probability measure under which the
n observations are i.i.d. with common density function g, assuming that g is positive
whenever f is. In particular, consider testing H0 : f = g versus the location alternative
f (x) = g(x − θ) for small positive values of θ. In this case, differentiating both sides of
(4) with respect to θ and letting θ ↓ 0 yield
<
∂ m
g (V ) n
P {R1 = r1 , . . . , Rm = rm }
(r )
=− Eg i
. (12.4)
∂θ θ=0 i=1
g(V (r i ) ) m
Hence by an extension of the Neyman-Pearson lemma, the derivative of the power func-
tion at θ = 0 is maximized by a test that rejects H0 when the right-hand side of (4)
exceeds some threshold C, which is chosen so that the test has type I error α when
θ = 0. This test, therefore, is locally most powerful, for testing alternatives of the form
f (x) = g(x − θ), with θ ↓ 0, and examples include the Fisher-Yates test when g is
standard normal and the Wilcoxon test when g(x) = ex /(1 + ex )2 is the logistic density.
247
12. Analysis of Censored Data
where θ = (θ1 , . . . , θk ) represents the parameter vector for sample (= 1, 2) and x1j ,
x2j are the data from sample 1 and sample 2 with respective sizes m and n − m that are
associated with the ranking (permutation) νj , j = 1, . . . , n!. Under the null hypothesis
H0 : θ 1 = θ 2 , we can assume without loss of generality that the underlying V1 , . . . , Vn
from the combined sample are i.i.d. uniform (by considering G(Vi ), where G is the com-
mon distribution function, assumed to be continuous, of the (Vi ) and that all rankings
of the Vi are equally likely. Hence the above model represents an exponential family
constructed by exponential tilting of the baseline measure (i.e., corresponding to H0 )
on the rank-order data. This has the same spirit as Neyman’s smooth test of the null
hypothesis that the data are i.i.d. uniform against alternatives in the exponential fam-
ily. The parametric embedding makes these results directly applicable to the rank-order
statistics as was discussed in Chapter 7. In particular, this shows that the two-sample
Wilcoxon test of H0 is locally powerful for testing the uniform distribution against the
truncated exponential distribution for which the xj are constrained to lie in the range
(0, 1) of the uniform distribution. Note that these exponential tilting alternatives differ
from the location alternatives in the preceding paragraph not only in their distributional
form (truncated exponential instead of logistic) but also in avoiding the strong assump-
tion of the preceding paragraph that the data have to be generated from the logistic
distribution even under the null hypothesis.
248
12. Analysis of Censored Data
n
Ri
Sn = (ci − c̄)ϕ . (12.6)
i=1
n+1
He derived the asymptotic normality of Sn under the null hypothesis and contiguous
alternatives, and showed the test to have asymptotically maximum power uniformly for
these alternatives if ϕ = −(f ◦ F −1 )/(f ◦ F −1 ), where F is the distribution function
with derivative f . Note that this result is consistent with the choice of the score function
given by (12.4) for locally most powerful tests. Hajek (1968) subsequently introduced
the projection method to extend these results to local alternatives that need not be
contiguous to the null.
The rank tests in the preceding paragraph deal with the regression setting, which
is related to the location alternatives in Section 12.1. If we focus on k-sample prob-
lems, then parametric embedding as in the second paragraph of that section can be
applied and the idea of local asymptotic normality (LAN), which was also introduced
by Le Cam in conjunction with contiguity, can be applied to derive the LAN prop-
erty of the embedded family. As pointed out in Chapter 7 of van der Vaart (2007), a
sequence of parametric models is LAN if asymptotically (as n → ∞) their likelihood
ratio processes behave like those for the normal mean model via a quadratic expan-
sion of the log-likelihood function. Hajek (1970, 1972) and Le Cam made use of the
LAN property to derive asymptotic optimality in parametric estimation and testing via
convolution theorems and local asymptotic minimax bounds; see van der Vaart (2007),
Chapter 8. The Hajek-Le Cam theory was originally introduced to resolve the long-
standing problem concerning the efficiency of the maximum likelihood estimator in a
parametric model. For the problem of estimating a location parameter or more general
regression parameters, there is a corresponding asymptotic minimax theory introduced
by Huber (1964, 1972, 1973, 1981) associated with robust estimators which consist of
three types: the maximum likelihood type M-estimators, the R-estimators which are
derived from the rank statistics, and the L-estimators which are linear combinations of
order statistics. See Chapters 5, 13, and 22 of van der Vaart (2007). Although “it is
customary to treat nonparametric statistical theory as a subject completely different
from parametric theory,” Stein developed the least favorable parametric subfamilies for
nonparametric testing and estimation as “one of the obvious connections between the
two subjects.” The implication of Stein’s idea on our parametric embedding theme is
the possibility of establishing full asymptotic efficiency of a nonparametric test by using
a “least favorable” parametric family of densities for parametric embedding (see Bickel
(1982)).
249
12. Analysis of Censored Data
for two samples assumes (a) equally likely rankings that give rise to p0j and i.i.d. uniform
G(V1 ), . . . , G(Vn ) under the null hypothesis, and (b) exponential tilting via distinct val-
ues of xj or xij that are functions of the ranks. Here, θ l = (θl1 , . . . , θlk ) represents the
parameter vector for sample l and x1j , x2j are the data for samples 1 and 2 respectively
with respective sizes m, n − m. Under the null hypothesis H0 : θ 1 = θ 2 and we can as-
sume without loss in generality that the underlying Vi are i.i.d. uniform. The Vi are not
completely observable when the data are censored so that the observations are (Ṽi , δi ),
where Ṽi = min(Vi , ci ) and δi = I{Vi ≤ci } . Since the rank assigned to Vi for complete data
250
12. Analysis of Censored Data
is the empirical distribution function evaluated to Vi , the analog for censored data is
Ĝ(Ṽi ), where Ĝ is the Kaplan-Meier estimator which is the nonparametric MLE of G
for censored data. Hence the model under the null hypothesis is that of i.i.d. uniform
random variables censored by G(ci ), providing a partial analog of (a). Since Ĝ puts
all its mass at the uncensored observations (with δ = 1), this causes some difficulty in
generalizing (b) because the sample also contains censored observations. We note that
at each uncensored observation Ṽi , the information in the ordered sample conveys not
only the value of Vi but also how many observations Ṽj in the sample are ≥ Ṽi . When
the Vi denote failure times in survival analysis, this means the size of the risk set, that is,
the number of subjects who are at risk at an observed failure time Vi . This resolves the
inherent difficulty of ordering the censored observations whose actual failure times are
unknown except for their exceedance over ci . To rank the data, we need to have a total
order of the sample space, but the subset consisting of censored observations cannot be
totally ordered because the underlying failure times are unknown. Using the observed
failure time and the risk set size at each uncensored observations gives a partial analog of
the ranking for complete data. To be at risk at an observed failure time Vi , the subject
cannot fail prior to Vi . The jump in Ĝ(Ṽi ) basically measures the conditional probability
of failing in an infinitesimal interval around Ṽi given that failure has not occurred prior
to Ṽi . This means that we should think of hazard functions instead of density functions
and perform exponential tilting using the hazard functions rather density functions.
Consider the two-sample problem with censored data. Let V(1) < · · · < V(k) de-
note the ordered uncensored observations of the combined sample, Nj (resp. Mj ) de-
note the number of observations in the combined sample (resp. in sample 1) that are
≥ V(j) , and uj = 1 (resp. 0) if V(j) comes from sample 1 (resp. sample 2). Note
that {(1), . . . , (k), M1 , N1 , . . . , Mk , Nk } is invariant under the group of strictly increas-
ing transformations on the testing problem. We now introduce embedding of the null
model into a smooth parametric family that also consists of alternatives. Instead of
tilting the density functions as before, we define the change of measures via intensity
(hazard) functions, as in Section II.7 of Andersen et al. (1993). Because the normalizing
constant e−K(θ) gets canceled in the numerator and denominator, it does not appear in
the likelihood ratio statistic. On the other hand, the denominator of (12.1) will induce
a function λ0 (t), which can be chosen as the baseline (or null hypothesis) hazard func-
tion, in the likelihood ratio. The analog therefore for one sample takes the proportional
hazards form
π(xj ;θ, t) = h0 (t) exp(θ xj ). (12.7)
We discuss below the choice of xj that extends xj = X(νj ) to LTRC data, for which we
also define the hazard-induced rank statistics.
251
12. Analysis of Censored Data
noting that comparisons can be made if the smaller of T̃i and T̃j is uncensored.1 Breslow
(1970) subsequently extended this to the k-sample case and expressed W in the counting
process form
W = Y (s) dN (s) − Y (s) dN (s), (12.9)
n−m m
where N1 (s) = m i=1 I{T̃1i ≤s,δ1i =1} , N2 (s) = j=1 I{T̃2j ≤s,δ2j =1} and Y1 (s) = i=1 I{T̃1i ≥s}
n−m
and Y2 (s) = j=1 I{T̃2j ≥s} are the corresponding risk set sizes.
Instead of the weight processes Y and Y that depend on both failures and censor-
ing, Prentice (1978) suggested that a better alternative should depend on the survival
experience in the combined sample. For complete data the classical two-sample rank
statistics have the form Sn = m i=1 an (Ri ), where the scores an (j) are obtained from a
score function ϕ on (0, 1] by an (j) = ϕ(j/n) so that Sn = m i=1 ϕ(Gn (Ti )), where Gn
is the distribution of the combined sample, or by some asymptotically equivalent vari-
ant such as the expected value of ϕ evaluated at the j th uniform order statistic from a
sample of size n. As pointed out, the counterpart of Gn (Ti ) for censored data is Ĝn (T̃i ),
where Ĝn is the Kaplan-Meier estimate based on the combined sample. If δi = 1, T̃i is
the actual failure time and has score ϕ(Ĝn (T̃i )). On the other hand, if δi = 0, then the
1
In fact, Gehan introduced a further refinement depending on whether the larger observation is
censored or not.
252
12. Analysis of Censored Data
actual failure time Ti is unknown, other than that it exceeds T̃i and therefore has score
Φ(Ĝn (T̃i )), where 1
Φ(t) = ϕ(v) dv/(1 − t), 0 ≤ t < 1, (12.10)
t
represents the average of scores ϕ(u) with u ≥ t. This leads to the following extension
of the classical rank statistic mi=1 ϕ(Gn (Ti )) to censored data:
m (
)
Sn∗ = δi ϕ(T̃i ) + (1 − δi )Φ(T̃i ) . (12.11)
i=1
where V(1) < · · · < V(k) are the ordered uncensored observations of the combined sample
and {Ṽ(i 1) , . . . , Ṽ(i νi ) } is the unordered set of censored observations between V(i) and
V(i+1) , setting V(0) = 0. Cuzick (1985) proved this conjecture under some smoothness
assumptions on ϕ and also extended the proof to show in his Section 3 the asymptotic
equivalence of (12.11) and
k
Mj
Sn = ψ Ĝn (V(j) ) cj − , where ψ = ϕ − Φ. (12.13)
j=1
Nj
This form of rank statistics for censored data dated back to Mantel (1966) with
ψ = 1. As shown by Gu et al. (1991), there is a one-to-one correspondence between ϕ
and ψ: t
ψ(s)
ϕ(t) = ψ(t) − ds, 0 < t < 1,
0 1−s
and rank statistics of the form (12.13) can be expressed in the form of generalized Mann-
n−m
Whitney statistics W = m i=1 j=1 w(T̃i , δi ; T̃j , δj ) with
⎧
⎪
⎨−nψ(Ĝn (T̃1i ))/Y• (T̃1i ) if T̃1i ≤ T̃2j and δ1i = 1
⎪
w(T̃1i , δ1i ; T̃2j , δ2j ) = nψ(Ĝn (T̃2j ))/Y• (T̃2j ) if T̃1i ≥ T̃2j and δ2j = 1 (12.14)
⎪
⎪
⎩0 otherwise,
m n−m
where Y• (s) = i=1 I{T̃1i ≥s} + j=1 I{T̃2j ≥s} is the risk set size of the combined sample
at s.
253
12. Analysis of Censored Data
The representation (12.13) is convenient for extensions from two-sample to the re-
gression setting in which cj are the covariates in the regression model Vi = Δci + εi , as
in (12.6). The Mj /Nj in (12.13) is now generalized to
=
n
c̄j = ci I{Ṽi ≥Ṽ(j) } Nj , (12.15)
i=1
which is the average value of the covariate associated with the risk set at the uncensored
observation Ṽ(j) . Lai and Ying (1991, Theorem 1) established the asymptotic normality
of these rank statistics under the null hypothesis H0 : Δ = 0 and under local alternatives.
Analogous to the complete data case, these tests are asymptotically efficient when ψ =
(λ ◦ F −1 )/(λ ◦ F −1 ), where F is the common distribution function and λ the hazard
function of the εi . They proved this result when the data can also be subject to left
truncation.
Suppose (ci , Vi , δi ) can be observed only when Ṽi = min (Vi , ξ) ≥ τi , where (τi , ξi , ci )
are independent random vectors that are independent of the εi . The τi are left truncation
variables and Vi is also subject to right censoring by ξi . The case ξi ≡ ∞ corresponds to
the left-truncated model, for which multiplication of Vi and τi by −1 converts it into a
right-truncated model. Motivated by a controversy in cosmology involving Hubble’s Law
and Chronometric Theory, Bhattacharya et al. (1983) introduced a Mann-Whitney-type
statistic Wn (β) = i
=j wij (Δ) in the regression model Vi = Δci + εi , in which ci
represents log velocity and Vi the negative log of luminosity; moreover, (ci , Vi ) can only
be observed if Vi ≤ v0 . This is a right-truncated model with truncation variables τi ≡ v0 ,
and letting (Vi∗ , c∗i ), i = 1, . . . , n denote the observations, they defined ei (β) = Vi∗ − Δc∗i
and ⎧
⎪ ∗ ∗ ∗
⎨ci − cj if ej (Δ) < ei (Δ) ≤ v0 − Δcj
⎪
wij (Δ) = c∗j − c∗i if ei (Δ) < ej (Δ) ≤ v0 − Δc∗i (12.16)
⎪
⎪
⎩0 otherwise
since it is impossible to compare ei (Δ) and ej (Δ) if
Note the similarity of this idea to that proposed by Gehan for censored data, and again
it has the same drawbacks as before. In fact, as shown by Lai and Ying (1991), what we
discussed in the preceding paragraph for censored data can be readily extended to LTRC
data (u∗i , Ṽi∗ , δi∗ ), i = 1, . . . , n, that are generated from the larger sample consisting of
+ ,
(Vi , ci ), i = 1, . . . , m(n) inf m : m i=1 I{τi ≤min(Vi ,ci )} = n , with (Ṽi , δi ) observable only
when Ṽi ≥ τi . The risk set size at t in this case is Y (t) = m(n) i=1 I{τi −Δci ≤t≤Ṽi −Δci } and
254
12. Analysis of Censored Data
( )
where Ij = i : Ṽi ≥ Ṽ(j) is the risk set at the ordered uncensored observation Ṽ(j) ,
which is the same as that given by Cox using conditional arguments and later by Cox
(1975) using partial likelihood.
( This can) be readily extended to LTRC data by redefining
the risk set at Ṽ(j) as i : Ṽi ≥ Ṽ(j) ≥ τi . Basically, the regression model in the preceding
paragraph considers the residuals Ṽi − Δci , whereas for hazard regression we consider Ṽi
instead.
255
12. Analysis of Censored Data
The asymptotic efficiency of the rank tests depends on the class of alternatives in the
embedded parametric family, which may not contain the actual alternative.
The problem of finding the parametric family that gives the best asymptotic minimax
bound has been an active area of research since the seminal paper of Stein that describes
a basic idea inherently related to the theme of this chapter:
The implication of Stein’s idea on our parametric embedding theme is the possibility of
establishing full asymptotic efficiency of a nonparametric/semi-parametric test by using
a “least favorable” parametric family of densities for parametric embedding. Lai and
Ying (1992, Section 2) have shown how this can be done for regression models with i.i.d.
additive noise εi . The least favorable parametric family has hazard functions of the form
λ(t) + θη(t), where η is an approximation to −λ Γ1 /Γ0 , λ is the hazard function of εi
and it is assumed that for h = 0, 1, 2,
m
Γh (s) = lim m−1 E{chi I{τi −Δci ≤s≤ci −Δci } /(1 − F (τi − Δci )}
m→∞
i=1
exists for every s with F (s) < 1, where F is the distribution function of εi . In particular,
the technical details underlying the approximation are given in (2.26 a, b, c) of that pa-
per. Lai and Ying (1991, 1992) have also shown how these semi-parametric information
bounds can be attained by using a score function that incorporates adaptive estimation
of λ. For a comprehensive overview of semi-parametric efficiency and adaptive estima-
tion in other contexts, see Bickel et al. (1993). For further details on the analysis of
censored data, see Alvo et al. (2018).
256
Appendices
257
A. Description of Data Sets
Age Group
16–19 20–34 35–54 55–69 >70
8.62 9.85 9.98 9.12 4.80
9.94 10.43 10.69 9.89 9.18
10.06 11.31 11.40 10.57 9.27
260
Description of Data Sets
eastern Japan has been the home of the Shogun and Samurai warriors. Therefore, the
preference patterns in food are different between these two regions (Kamishima, 2003).
The first data set is a complete ranking data with t = 10. 5000 respondents are
asked to rank 10 different kinds of Sushi according to their preference. The region of
respondents is also recorded (N=3285 for eastern Japan, 1715 for western Japan).
261
Description of Data Sets
262
Description of Data Sets
263
Bibliography
Aalen, O. (1978). Nonparametric estimation of partial transition probabilities in multiple
decrement models. Ann. Statist., 6(3):534–545.
Alvo, M. (2016). Bridging the gap: a likelihood: function approach for the analysis of
ranking data. Communications in Statistics - Theory and Methods, Series A, 45:5835–
5847.
Alvo, M. and Berthelot, M.-P. (2012). Nonparametric tests of trend for proportions.
International Journal of Statistics and Probability, 1:92–104.
Alvo, M. and Cabilio, P. (1991). On the balanced incomplete block design for rankings.
The Annals of Statistics, 19:1597–1613.
Alvo, M. and Cabilio, P. (1994). Rank test of trend when data are incomplete. Envi-
ronmetrics, 5:21–27.
Alvo, M. and Cabilio, P. (1995). Testing ordered alternatives in the presence of incom-
plete data. Journal of the American Statistical Association, 90:1015–1024.
Alvo, M. and Cabilio, P. (1999). A general rank based approach to the analysis of block
data. Communications in Statistics: Theory and Methods, 28:197–215.
Alvo, M. and Cabilio, P. (2005). General scores statistics on ranks in the analysis of
unbalanced designs. The Canadian Journal of Statistics, 33:115–129.
Alvo, M., Cabilio, P., and Feigin, P. (1982). Asymptotic theory for measures of concor-
dance with special reference to Kendall’s tau. The Annals of Statistics, 10:1269–1276.
Alvo, M., Lai, T. L., and Yu, P. L. H. (2018). Parametric embedding of nonparametric
inference problems. Journal of Statistical Theory and Practice, 12(1):151–164.
Alvo, M. and Pan, J. (1997). A general theory of hypothesis testing based on rankings.
Journal of Statistical Planning and Inference, 61:219–248.
Alvo, M. and Xu, H. (2017). The analysis of ranking data using score functions and
penalized likelihood. Austrian Journal of Statistics, 46:15–32.
Alvo, M. and Yu, P. L. H. (2014). Statistical Methods for Ranking Data. Springer.
Alvo, M., Yu, P. L. H., and Xu, H. (2017). A semi-parametric approach to the multiple
change-point problem. Working paper, The University of Hong Kong.
Andersen, P., Borgan, O., Gill, R., and Keiding, N. (1993). Statistical Models Based on
Counting Processes. Springer: New York.
Asmussen, S., Jensen, J., and Rojas-Nandayapa, L. (2016). Exponential family tech-
niques for the lognormal left tail. Scandinavian Journal of Statistics, 43:774–787.
Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change
models. Journal of Applied Econometrics, 18(1):1–22.
Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hyper-
sphere using von Mises-Fisher distributions. Journal of Machine Learning Research,
6(Sep):1345–1382.
266
Bibliography
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and
Adaptive Estimation for Semiparametric Models. The Johns Hopkins University Press.
Billingsley, P. (2012). Probability and Measure. John Wiley and Sons, anniversary
edition.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review
for statisticians. Journal of the American Statistical Association, 112(518):859–877.
Box, George, E. and Tiao, George, C. (1973). Bayesian Inference in Statistical Analysis.
Addison-Wesley Publishing Company.
Busse, L. M., Orbanz, P., and Buhmann, J. M. (2007). Cluster analysis of heterogeneous
rank data. In Proceedings of the 24th International Conference on Machine Learning,
pages 113–120.
Cabilio, P. and Peng, J. (2008). Multiple rank-based testing for ordered alternatives
with incomplete data. Statistics and Probability Letters, 78:2609–2613.
Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American
Statistician, 46:167–174.
267
Bibliography
Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B.,
34(2):187–220.
Critchlow, D. (1985). Metric Methods for Analyzing Partially Ranked Data. Springer-
Verlag: New York.
Cuzick, J. (1985). Asymptotic properties of censored linear rank tests. Ann. Statist.,
13(1):133–141.
Efron, B. (1981). Nonparametric standard errors and confidence intervals. The Canadian
Journal of Statistics, 9(2):139–158.
268
Bibliography
Feigin, P. D. and Alvo, M. (1986). Intergroup diversity and concordance for ranking
data: an approach via metrics for permutations. The Annals of Statistics, 14:691–707.
Feigin, P. D. and Cohen, A. (1978). On a model for concordance between judges. Journal
of the Royal Statistical Society Series B, 40:203–213.
Ferguson, T. (1996). A Course in Large Sample Theory. John Wiley and Sons.
Fraser, D. (1957). Non Parametric Methods in Statistics. John Wiley and Sons., New
York.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in
the analysis of variance. Journal of the American Statistical Association, 32:675–701.
Gao, X. and Alvo, M. (2005a). A nonparametric test for interaction in two-way layouts.
The Canadian Journal of Statistics, 33:1–15.
Gao, X. and Alvo, M. (2005b). A unified nonparametric approach for unbalanced fac-
torial designs. Journal of the American Statistical Association, 100:926–941.
Gao, X. and Alvo, M. (2008). Nonparametric multiple comparison procedures for unbal-
anced two-way layouts. Journal of Statistical Planning and Inference, 138:3674–3686.
Gao, X., Alvo, M., Chen, J., and Li, G. (2008). Nonparametric multiple comparison
procedures for unbalanced one-way factorial designs. Journal of Statistical Planning
and Inference, 138:2574–2591.
Garcia, R. and Perron, P. (1996). An analysis of the real interest rate under regime
shifts. The Review of Economics and Statistics, pages 111–125.
269
Bibliography
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6:721–741.
Gu, M. G., Lai, T. L., and Lan, K. K. G. (1991). Rank tests based on censored data
and their sequential analogues. Amer. J. Math. & Management Sci., 11(1–2):147–176.
Hajek, J. (1962). Asymptotically most powerful rank-order tests. Ann. Math. Statist.,
33(3):1124–1147.
Hajek, J. (1968). Asymptotic normality of simple linear rank statistics under alterna-
tives. Ann. Math. Statist., 39:325–346.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57:97–109.
Hájek, J. and Sidak, Z. (1967). Theory of Rank Tests. Academic Press, New York.
270
Bibliography
Huber, P. J. (1972). The 1972 Wald lecture robust statistics: A review. Annals of
Mathematical Statistics, 43(4):1041–1067.
Jin, W. R., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gundel, G., and Gib-
son, G. (2001). The contribution of sex, genotype and age to transcriptional variance
in drosophila melanogaster. Nature Genetics, 29:389–395.
John, J. and Williams, E. (1995). Cyclic Designs. Chapman Hall, New York.
Kendall, M. and Stuart, A. (1979). The Advanced Theory of Statistics, volume 2. Griffin,
London, fourth edition.
Kidwell, P., Lebanon, G., and Cleveland, W. S. (2008). Visualizing incomplete and
partially ranked data. IEEE Transactions on Visualization and Computer Graphics,
14(6):1356–1363.
271
Bibliography
Killick, R., Fearnhead, P., and Eckley, I. (2012). Optimal detection of changepoints
with a linear computational cost. Journal of the American Statistical Association,
107(500):1590–1598.
Kruskal, W. H. (1952). A nonparametric test for the several sample problem. Annals of
Mathematical Statistics, 23:525–540.
Lai, T. L. and Ying, Z. (1991). Rank regression methods for left-truncated and right-
censored data. Ann. Statist., 19(2):531–556.
Liang, F., Liu, C., and Carroll, J. D. (2010). Advanced Markov Chain Monte Carlo
Methods. John Wiley & Sons.
Lindsay, B. G. and Qu, A. (2003). Inference functions and quadratic score tests. Statist.
Sci., 18(3):394–410.
Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising
in its consideration. Cancer Chemotherapy Reports, 50(3):163–170.
Marden, J. I. (1995). Analyzing and Modeling Rank Data. Chapman Hall, New York.
272
Bibliography
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). Equations of state calculations by fast computing machines. Journal of Chem-
ical Physics, 21:1087–1092.
Mielke, Paul W., J. and Berry, Kenneth, J. (2001). Permutation Methods: A Distance
Function Approach. Springer.
Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of
statistical hypotheses. Philo. Trans. Roy. Soc. A, 231:289–337.
Page, E. (1963). Ordered hypotheses for multiple treatments: a significance test for
linear ranks. Journal of the American Statistical Association, 58:216–230.
Pearson, K. (1900). On the criterion that a given system of deviations from the probable
in the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling. Philosophical Magazine, pages 157–175.
Prentice, R. (1978). Linear rank tests with right censored data. Biometrika, 65:167–179.
Qian, Z. and Yu, P. L. H. (2018). Weighted distance-based models for ranking data
using the r package rankdist. Journal of Statistical Software, page Forthcoming.
Ralston, A. (1965). A First Course in Numerical Analysis. McGraw Hill, New York.
Rayner, J. C. W., Best, D. J., and Thas, O. (2009a). Generalised smooth tests of
goodness of fit. Journal of Statistical Theory and Practice, pages 665–679.
273
Bibliography
Rayner, J. C. W., Thas, O., and Best, D. J. (2009b). Smooth Tests of Goodness of Fit
Using R. John Wiley and Sons, 2nd edition.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer, New
York, 2nd edition.
Royston, I. (1982). Expected normal order statistics (exact and approximate). Journal
of the Royal Statistical Society Series C, 31(2):161–165. Algorithm AS 177.
Schach, S. (1979). An alternative to the Friedman test with certain optimality properties.
Ann. Statist., 7(3):537–550.
Siegel, S. and Tukey, J. W. (1960). A nonparametric sum of ranks procedure for relative
spread in unpaired samples. Journal of the American Statistical Association, 55:429–
445.
Siegmund, D. (1976). Importance sampling in the Monte Carlo study of sequential tests.
The Annals of Statistics, 4(4):673–684.
Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamil-
ton, G., Hindle, A. K., Huey, B., Kimura, K., et al. (2001). Assembly of microarrays
for genome-wide measurement of DNA copy number. Nature genetics, 29(3):263–264.
Sra, S. (2012). A short note on parameter approximation for von Mises-Fisher distribu-
tions: and a fast implementation of i s (x). Computational Statistics, 27(1):177–190.
Taghia, J., Ma, Z., and Leijon, A. (2014). Bayesian estimation of the von-Mises Fisher
mixture model with variational inference. IEEE transactions on pattern analysis and
machine intelligence, 36(9):1701–1715.
274
Bibliography
Terry, M. (1952). Some rank order tests which are most powerful against specific para-
metric alternatives. Ann. Math. Statist., 23(3):346–366.
Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion
and rejoinder). Annals of Statistics, 22(4):1701–1762.
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods.
Statistica Sinica, 21:5–42.
Xu, H., Alvo, M., and Yu, P. L. H. (2018). Angle-based models for ranking data.
Computational Statistics and Data Analysis, 121:113–136.
Yu, P. L. H., Lam, K. F., and Alvo, M. (2002). Nonparametric rank test for independence
in opinion surveys. Austrian Journal of Statistics, 31:279–290.
Yu, P. L. H., Lam, K. F., and Lo, S. M. (2005). Factor analysis for ranked data with
application to a job selection attitude survey. Journal of the Royal Statistical Society
Series A, 168(3):583–597.
275
Index
A Contingency tables, 87
Absolutely continuous, 32 Convergence almost surely, 12
Acceptance/rejection sampling, 39 Convergence in distribution, 12
Analysis of censored data, 245 Convergence in probability, 12
Array CGH Data, 220 Cramr-Rao inequality, 19
Cramér-Wold, 16
B Cumulant, 70
Balanced incomplete block design, 157 Cumulative distribution function, 5
Bayesian methods, 37 Cyclic designs, 158
Bayesian Models, 229 Cyclic structure models, 86
Beta distribution, 38
Borel-Cantelli lemma, 11 D
Delta method, 14
C Design problems, 152
Categorical data, 76 Distance between sets of ranks, 118
Cayley, 80, 84, 86 Distance function, 80
Central moment, 7 Distance-based models, 79
Change of measure distributions, 71 Durbin test, 158
Change-point problem, 209
Chebyshev inequality, 12 E
Chebyshev polynomials, 67, 72 Efficiency, 187
Combinatorial central limit theorem, 55 Exponential family, 19
Compatibility, 152 Exponential tilting, 19
Complete block designs, 175 Extreme value density, 173
Composite likelihood, 35 F
Concordance, 139 Fisher information matrix, 20
Conditional density, 6 Fisher-Yates normal scores, 172
Conjugate prior, 38, 233 Fisher-Yates test, 166, 167
Consistency, 18
Contiguity, 31 G
Gamma density, 43
L P
Laguerre polynomials, 67 Pearson goodness of fit statistic, 78
Le Cam’s first lemma, 32 Penalized likelihood, 147
Le Cam’s second lemma, 33 Permutation, 79
Le Cam’s third lemma, 33 Permutation test, 101
Leisure time data, 259 φ-Component models, 84
Likelihood function, 17 Pitman efficiency, 187
278
Index
279