100% found this document useful (4 votes)
391 views277 pages

A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu

as

Uploaded by

asat12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
391 views277 pages

A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu

as

Uploaded by

asat12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

Springer Series in the Data Sciences

Mayer Alvo · Philip L. H. Yu

A Parametric
Approach to
Nonparametric
Statistics
Springer Series in the Data Sciences

Series Editors:
Jianqing Fan, Princeton University, Princeton
Michael Jordan, University of California, Berkeley
Ravi Kannan, Microsoft Research Labs, Bangalore
Yurii Nesterov, Universite Catholique de Louvain, Louvain-la-Neuve
Christopher Ré, Stanford University, Stanford
Larry Wasserman, Carnegie Mellon University, Pittsburgh

Springer Series in the Data Sciences focuses primarily on monographs and graduate level textbooks. The target
audience includes students and researchers working in and across the fields of mathematics, theoretical computer
science, and statistics.
Data Analysis and Interpretation is a broad field encompassing some of the fastestgrowing subjects in in-
terdisciplinary statistics, mathematics and computer science. It encompasses a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and
supporting decision making. Data analysis has multiple facets and approaches, including diverse techniques un-
der a variety of names, in different business, science, and social science domains. Springer Series in the Data
Sciences addresses the needs of a broad spectrum of scientists and students who are utilizing quantitative methods
in their daily research.
The series is broad but structured, including topics within all core areas of the data sciences. The breadth of
the series reflects the variation of scholarly projects currently underway in the field of machine learning.

More information about this series at http://www.springer.com/series/13852


Mayer Alvo • Philip L. H. Yu

A Parametric Approach to
Nonparametric Statistics

123
Mayer Alvo Philip L. H. Yu
Department of Mathematics and Statistics Department of Statistics and Actuarial Science
University of Ottawa University of Hong Kong
Ottawa, ON, Canada Hong Kong, China

ISSN 2365-5674 ISSN 2365-5682 (electronic)


Springer Series in the Data Sciences
ISBN 978-3-319-94152-3 ISBN 978-3-319-94153-0 (eBook)
https://doi.org/10.1007/978-3-319-94153-0

Library of Congress Control Number: 2018951779

© Springer Nature Switzerland AG 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically
the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence
of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims
in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
In randomized block designs, the Friedman statistic provides a nonparametric test of the
null hypothesis of no treatment effect. This book was motivated by the observation that
when the problem is embedded into a smooth alternative model to the uniform distribu-
tion over a set of rankings, this statistic emerges as a score statistic. The realization that
this nonparametric problem could be viewed within the context of a parametric problem
was particularly revealing and led to various consequences. Suddenly, it seemed that
one could exploit the tools of parametric statistics to deal with several nonparamet-
ric problems. Penalized likelihood methods were used in this context to focus on the
important parameters. Bootstrap methods were used to obtain approximations to the
distributions of estimators and to construct confidence intervals. Bayesian methods were
introduced to widen the scope of applicability of distance-based models. As well, the
more commonly used test statistics in nonparametric statistics were reexamined. The
occurrence of ties in the sign test could be dealt with in a natural formal manner as
opposed to the traditional ad hoc approach. This book is a first attempt at bridging the
gap between parametric and nonparametric statistics and we expect that in the future
more applications of this approach will be forthcoming.
The authors are grateful to Mr. Hang Xu for contributions that were incorporated
in Chapter 10. We are grateful to our families for their support throughout the writing
of this book. In particular, we thank our wives Helen and Bonnie for their patience and
understanding. We are also grateful for the financial support of the Natural Sciences
and Engineering Research Council of Canada (NSERC) and the Research Grants Coun-
cil of the Hong Kong Special Administrative Region, China (Project No. 17303515),
throughout the preparation of this book.
Ottawa, ON, Canada Mayer Alvo
Hong Kong, China Philip L. H. Yu

V
Contents

I. Introduction and Fundamentals 1


1. Introduction 3

2. Fundamental Concepts in Parametric Inference 5


2.1. Concepts in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1. Random Variables and Probability Functions . . . . . . . . . . . . 5
2.1.2. Modes of Convergence and Central Limit Theorems . . . . . . . . 11
2.2. Multivariate Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 16
2.3. Concepts in Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1. Parametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2. Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3. Composite Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.4. Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5. Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3. Tools for Nonparametric Statistics 45


3.1. Linear Rank Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2. U Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3. Hoeffding’s Combinatorial Central Limit Theorem . . . . . . . . . . . . . 55
3.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

II. Nonparametric Statistical Methods 61


4. Smooth Goodness of Fit Tests 63
4.1. Motivation for the Smooth Model . . . . . . . . . . . . . . . . . . . . . . 64
4.2. Neyman’s Smooth Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1. Smooth Models for Discrete Distributions . . . . . . . . . . . . . 72
4.2.2. Smooth Models for Composite Hypotheses . . . . . . . . . . . . . 74
4.3. Smooth Models for Categorical Data . . . . . . . . . . . . . . . . . . . . 76

VII
Contents

4.4. Smooth Models for Ranking Data . . . . . . . . . . . . . . . . . . . . . . 79


4.4.1. Distance-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.2. φ-Component Models . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.3. Cyclic Structure Models . . . . . . . . . . . . . . . . . . . . . . . 86
4.5. Goodness of Fit Tests for Two-Way Contingency Tables . . . . . . . . . . 87
4.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5. One-Sample and Two-Sample Problems 91


5.1. Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.1. Confidence Interval for the Median . . . . . . . . . . . . . . . . . 94
5.1.2. Power Comparison of Parametric and Nonparametric Tests . . . . 96
5.2. Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3. Two-Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1. Permutation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.2. Mann-Whitney-Wilcoxon Rank-Sum Test . . . . . . . . . . . . . . 105
5.3.3. Confidence Interval and Hodges-Lehmann Estimate for the
Location Parameter Δ . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.4. Test for Equality of Scale Parameters . . . . . . . . . . . . . . . 111
5.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6. Multi-Sample Problems 117


6.1. A Unified Theory of Hypothesis Testing . . . . . . . . . . . . . . . . . . 117
6.1.1. A General Approach . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.2. The Multi-Sample Problem in the Ordered Case . . . . . . . . . . 119
6.1.3. The Multi-Sample Problem in the Unordered Case . . . . . . . . 122
6.1.4. Tests for Umbrella Alternatives . . . . . . . . . . . . . . . . . . . 124
6.2. Test for Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3. Tests of the Equality of Several Independent Samples . . . . . . . . . . . 127
6.4. Tests for Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4.1. Tests for Interaction: More Than One Observation per Cell . . . 131
6.5. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7. Tests for Trend and Association 137


7.1. Tests for Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2. Problems of Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2.1. Application Using Spearman Scores . . . . . . . . . . . . . . . . . 141
7.2.2. Application Using Kendall Scores . . . . . . . . . . . . . . . . . . 142
7.2.3. Application Using Hamming Scores . . . . . . . . . . . . . . . . . 144
7.3. The Two-Sample Ranking Problem . . . . . . . . . . . . . . . . . . . . . 145

VIII
Contents

7.4. The Use of Penalized Likelihood in Tests of Concordance: One and


Two Group Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.5. Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.5.1. Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.5.2. General Block Designs . . . . . . . . . . . . . . . . . . . . . . . . 154
7.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8. Optimal Rank Tests 163


8.1. Locally Most Powerful Rank-Based Tests . . . . . . . . . . . . . . . . . 164
8.2. Regression Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.3. Optimal Tests for the Method of n-Rankings . . . . . . . . . . . . . . . . 174
8.3.1. Complete Block Designs . . . . . . . . . . . . . . . . . . . . . . . 175
8.3.2. Incomplete Block Designs . . . . . . . . . . . . . . . . . . . . . . 176
8.3.3. Special Cases: Wilcoxon Scores . . . . . . . . . . . . . . . . . . . 180
8.3.4. Other Score Functions . . . . . . . . . . . . . . . . . . . . . . . . 180
8.3.5. Asymptotic Distribution Under the Null Hypothesis . . . . . . . . 182
8.3.6. Asymptotic Distribution Under the Alternative . . . . . . . . . . 183
8.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

9. Efficiency 187
9.1. Pitman Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2. Making Use of Le Cam’s Lemmas . . . . . . . . . . . . . . . . . . . . . . 193
9.2.1. Asymptotic Distributions Under the Alternative in the General
Multi-Sample Problem: The Spearman Case . . . . . . . . . . . . 195
9.2.2. Asymptotic Distributions Under the Alternative in the General
Multi-Sample Problem: The Hamming Case . . . . . . . . . . . . 199
9.3. Asymptotic Efficiency in the Unordered Multi-Sample Test . . . . . . . . 203
9.4. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

III. Selected Applications 207


10.Multiple Change-Point Problems 209
10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2. Parametric Formulation for Change-Point Problems . . . . . . . . . . . . 209
10.2.1. Estimating the Location of a Change-Point using a Composite
Likelihood Approach . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.2.2. Estimation of Multiple Change-Points . . . . . . . . . . . . . . . . 212
10.2.3. Testing the Significance of a Change-Point . . . . . . . . . . . . . 213

IX
Contents

10.3. Consistency of the Estimated Change-Point Locations . . . . . . . . . . . 214


10.3.1. The Case of Single Change-Point . . . . . . . . . . . . . . . . . . 214
10.3.2. The Case of Multiple Change-Points . . . . . . . . . . . . . . . . 215
10.4. Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.4.1. Model Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.4.2. Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.5. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.5.1. Array CGH Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.5.2. Interest Rate Data . . . . . . . . . . . . . . . . . . . . . . . . . . 223

11.Bayesian Models for Ranking Data 229


11.1. Maximum Likelihood Estimation (MLE) of Our Model . . . . . . . . . . 233
11.2. Bayesian Method with Conjugate Prior and Posterior . . . . . . . . . . . 233
11.3. Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
11.3.1. Optimization of the Variational Distribution . . . . . . . . . . . . 235
11.3.2. Comparison of the True Posterior Distribution and Its
Approximation Obtained by Variational Inference . . . . . . . . . 237
11.3.3. Angle-Based Model for Incomplete Rankings . . . . . . . . . . . . 238
11.4. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.4.1. Sushi Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.4.2. APA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

12.Analysis of Censored Data 245


12.1. Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
12.1.1. Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . . . . . . . 246
12.1.2. Locally Most Powerful Tests . . . . . . . . . . . . . . . . . . . . . 247
12.2. Local Asymptotic Normality, Hajek-Le Cam Theory, and Stein’s Least
Favorable Parametric Submodels . . . . . . . . . . . . . . . . . . . . . . 248
12.3. Parametric Embedding with Censored and Truncated Data . . . . . . . . 250
12.3.1. Extension of Parametric Embedding to Censored Data . . . . . . 250
12.3.2. From Gehan and Bhattacharya et al. to Hazard-Induced Rank
Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
12.3.3. Semi-parametric Efficiency via Least Favorable Parametric
Submodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

A Description of Data Sets 259


A.1. Sutton Leisure Time Data . . . . . . . . . . . . . . . . . . . . . . . . . . 259
A.2. Umbrella Alternative Data . . . . . . . . . . . . . . . . . . . . . . . . . 259
A.3. Song Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
A.4. Goldberg Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

X
Contents

A.5. Sushi Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260


A.6. APA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
A.7. January Precipitation Data (in mm) for Saint John, New Brunswick,
Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
A.8. Annual Temperature Data (in ◦ C) in Hong Kong . . . . . . . . . . . . . 263

Index 277

XI
Notation

R a set of all real numbers


X = (X1 , . . . , Xn ) a random vector
i.i.d. independent and identically distributed
θ a parameter, possibly a vector
H0 null hypothesis
H1 alternative hypothesis
fX (x; θ) the probability density function of X
FX (x; θ) the cumulative distribution function of X
cdf cumulative distribution function
pdf probability density function
Φ (x) the cumulative distribution of a standard normal random variable
Nk (μ, Σ) the k dimensional multivariate normal distribution with mean μ
and variance-covariance Σ
vMF(x|m, κ) von Mises-Fisher distribution
π (x; θ) a smooth alternative density of X
U (θ; X) score vector
U (X) score vector evaluated under the null hypothesis
R1 , . . . , R n the ranks of X1 , . . . , Xn from the smallest to the largest
an (R) score function evaluated at rank R
L (θ) or L (θ; x) the likelihood function of θ based on X1 , . . . , Xn
l (θ) or l (θ; x) the log of the likelihood function of θ based on X1 , . . . , Xn
In (θ) the Fisher information function of θ based on X1 , . . . , Xn
X̄n the sample mean of X1 , . . . , Xn
Sn2 the sample variance of X1 , . . . , Xn
P
Xn −
→X the sequence of random variables Xn converges in probability to X
L
Xn −
→X the sequence of random variables Xn converges in distribution to X
a.s.
Xn −−→ X the sequence of random variables Xn converges almost surely to X
E [X] mean of X
Var(X) variance of X
CLT central limit theorem

XIII
Notation

LAN local asymptotic normality


MLE maximum likelihood estimation
T1 ∼ T2 the statistics T1 and T2 are asymptotically equivalent
χ2r (δ) the chi-squared distribution with d.f. r and noncentrality parameter δ
ANOVA the analysis of variance
VI variational inference
MCMC Markov Chain Monte Carlo
SIR sampling-importance-resampling
ELBO evidence lower bound
LTRC left-truncated right-censored

XIV
Part I.

Introduction and Fundamentals

1
1. Introduction
This book grew out of a desire to bridge the gap between parametric and nonparametric
statistics and to exploit the best aspects of the former while enjoying the robustness
properties of the latter. Parametric statistics is a well-established field which incorpo-
rates the important notions of likelihood and sufficiency that are part of estimation and
testing. Likelihood methods have been used to construct efficient estimators, confidence
intervals, and tests with good power properties. They have also been used to incorporate
incomplete data and to pool information from different sources collected under different
sampling schemes. As well, Bayesian methods which rely on the likelihood function
can be used to combine information acquired through a prior distribution. Constraints
which restrict the domain of the likelihood function can also be taken into account.
Likelihood functions are Bartlett correctable which helps to improve the accuracy of
the inference. Additional tools such as penalized likelihood via Akaike or Bayesian
information criterion can take into account constraints on the parameters. Recently,
the notion of composite likelihood has been introduced to extend the range of appli-
cations. Problems of model selection are naturally dealt with through the likelihood
function.
A difficulty that arises with parametric inference is that we need to know the
underlying distribution of the random variable up to some parameters. If that dis-
tribution is misspecified, then inferences based on the likelihood may be inefficient
and confidence intervals and tests may not lead to correct conclusions. Hence, when
there is uncertainty as to the exact nature of the distribution, one may alternatively
make use of traditional nonparametric methods which avoid distributional assump-
tions. Such methods have proven to be very efficient in several instances although
their power is generally less than the analogous parametric counterparts. Moreover,
nonparametric statistics rely more on intuition and the subject has developed along in a
nonsystematic manner, always in an attempt to mimic parametric statistics. Boot-
strap methods could also be used in most cases to provide estimates and to con-
struct confidence intervals but their interpretation may not be easy. For example, the
shape of a confidence region for a vector parameter will often appear as a cloud in
space.
To act as a bridge between parametric and nonparametric statistics, Conover and
Iman (1981) used rank transformations in an ad hoc manner. They suggested using
parametric methods based on the ranks of the data in order to conduct nonparametric

© Springer Nature Switzerland AG 2018 3


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_1
1. Introduction

analyses. However, as mentioned by Conover and Iman (1981), such an approach has a
number of limitations, for instance the severe lack of robustness for the test for equality
of variances.
In a landmark paper, Neyman (1937) considered the nonparametric goodness of fit
problem and introduced the notion of smooth tests of fit by proposing a parametric
family of alternative densities to the null hypothesis. The type of embedding proposed
by Neyman was further elaborated by Rayner et al. (2009a) in connection with good-
ness of fit problems. In this book, we propose an embedding which focuses on local
properties more in line with the notion of exponential tilting. Hence, we obtain a new
derivation of the well-known Friedman statistic as the locally most powerful test in an
embedded family of distributions. In another direction, we exploit Hoeffding’s change
of measure formula which provides an approach to obtaining locally most powerful tests
based on ranks for various multi-sample problems. This is then followed by applications
of Le Cam’s three lemmas in order to obtain the asymptotic distribution of various statis-
tics under the alternative. Together, these results enable us to determine the asymptotic
relative efficiency of our test statistics.
This book is divided into three parts. In Part I, we outline briefly fundamental
concepts in probability and statistics. We introduce the reader to some of the important
tools in nonparametric statistics such as U statistics and linear rank statistics. In Part II,
we describe Neyman’s smooth tests in connection with goodness of fit problems and we
obtain test statistics for some common nonparametric problems. We then proceed to
make use of this concept in connection with the usual one- and two-sample tests. In
Chapter 6, we present a unified theory of hypothesis testing and apply it to study multi-
sample problems of location. We illustrate the theory in the case of the multi-sample
location problem as well as the problem involving umbrella alternatives. In Chapter 7,
we obtain a new derivation of the Friedman statistic and show it is locally most powerful.
We then make us of penalized likelihood to gain further insight into the rankings selected
by the sample. Chapter 8 deals with locally most powerful tests, whereas Chapter 9 is
devoted to the concept of efficiency. In Part III, we consider some modern applications
of nonparametric statistics. Specifically, we couch the multiple change-point problem
within the context of a smooth alternative. Next, we propose a new Bayesian approach
to the study of ranking problems. We conclude with Chapter 12 wherein we briefly
describe the application of methodology to the analysis of censored data.

4
2. Fundamental Concepts
in Parametric Inference
In this chapter we review some terminology and basic concepts in probability and
classical statistical inference which provide the notation and fundamental background
to be used throughout this book. In the section on probability we describe some basic
notions and list some common distributions along with their mean, variance, skewness,
and kurtosis. We also describe various modes of convergence and end with central limit
theorems. In the section on statistical inference, we begin with the subjects of estima-
tion and hypothesis testing and proceed with the notions of contiguity and composite
likelihood.

2.1. Concepts in Probability


2.1.1. Random Variables and Probability Functions
The study of probability theory is based on the concept of random variables. The sample
space is the set of all possible outcomes of an experiment. A real valued random variable
X is defined as a function from the sample space to the set of real numbers. As an
example, suppose in an experiment of rolling two independent dice, X represents the
sum of the face values of the two dice. Then the sample space consists of the ordered
pairs {(i, j) : i, j = 1, 2, . . . , 6} and the range of X is {2, 3, 4, . . . , 12}.
The cumulative distribution function (cdf) of a random variable X is the probability
that X takes on a value less than or equal to x, denoted by FX (x) = P (X ≤ x). Suppose
30% of students have heights less than or equal to 150 cm. Then the probability that
the height X of a randomly chosen student is less than or equal to 150 cm is 0.3 which
can be expressed as
FX (150) = P (X ≤ 150) = 0.3.
The cdf FX (x) in general is a nondecreasing right continuous function of x, which
satisfies
lim FX (x) = 0, lim FX (x) = 1.
x−→−∞ x−→∞

© Springer Nature Switzerland AG 2018 5


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_2
2. Fundamental Concepts in Parametric Inference

Most random variables are either discrete or continuous. We say that a random
variable is continuous if its cdf is a continuous function having no jumps. A continuous
random variable, such as weight, length, or lifetime, takes any numerical value in an
interval or on the positive real line. Typically, a continuous cdf has a derivative except
at some points. This derivative denoted by

d
fX (x) = FX (x) = FX (x),
dx
is called the probability density function (pdf ) of X. The cdf of a continuous random
variable X on the entire real line satisfies
 ∞  x
fX (x) ≥ 0 fX (t)dt = 1 and FX (x) = fX (t) dt.
−∞ −∞

A random variable is called discrete if it takes a finite or a countably infinite number


of values. Examples are the result of a soccer game (win, lose, or draw), the number of
sunny days in a week, or the number of patients infected with a disease. The probability
mass function (pmf) of a discrete random variable is defined as fX (x) = P (X = x). The
pmf for a discrete random variable X, fX (x) should satisfy the following conditions:

fX (x) ≥ 0 and fX (x) = 1,
x∈Ω

where Ω is the range of X.


There are occasions when one is interested in the conditional density of a random
variable X given a random variable Y. This is defined as

f (x, y)
f (x|y) = , fY (y) > 0.
fY (y)

The joint cdf for a vector of p (≥ 1) random variables, X = (X1 , X2 , . . . , Xp ), is


defined as
FX (x1 , x2 , . . . , xp ) = P (X1 ≤ x1 , . . . , Xp ≤ xp ) .
The joint density can be defined as

∂ p FX (x1 , x2 , . . . , xp )
fX (x1 , x2 , . . . , xp ) =
∂x1 . . . ∂xp

provided the multiple derivatives exist. The p random variables X1 , X2 , . . . , Xp are said
to be independent if their joint pdf is the product of the p individual densities, labeled
marginal pdfs and denoted {fXi (x)}, i.e.,

fX (x1 , x2 , . . . , xp ) = fX1 (x1 )fX2 (x2 ) · · · fXp (xp ), for all x1 , . . . , xp .

6
2. Fundamental Concepts in Parametric Inference

Definition 2.1. A random sample of size p of the random variable X is a set of inde-
pendent and identically distributed (i.i.d.) random variables X1 , X2 , . . . , Xp , with the
same pdf as X.
Definition 2.2. The order statistics from a sample of random variables X1 , X2 , . . . , Xp
are denoted X(1) ≤ X(2) ≤ . . . ≤ X(p) and indicate which are the smallest, second
smallest, etc.
Hence, X(1) = min {X1 , X2 , . . . , Xp } and X(p) = max {X1 , X2 , . . . , Xp } .
In probability and statistics, we are interested in properties of the distribution of a
random variable. The expected value of a function g(X) of a real valued random variable
X, denoted by E[g(X)], is defined as
 ∞
−∞
g(x)f (x) dx if X is continuous
E[g(X)] = 
x∈Ω g(x)f (x) if X is discrete.

Similarly, the conditional expectation of a function g (X) given (Y = y) is defined as


 ∞
−∞
g(x)f (x|y) dx if X is continuous
E[g(X)|Y = y] = 
x∈Ω g(x)f (x|y) if X is discrete.

We encounter the conditional expectation in regression problems.


The nth moment of a random variable X is E [X n ] and the nth central moment
about its mean is μn = E[(X − μ)n ], where μ = E [X], the mean of X. Note that μ1 = 0
and μ2 = V ar(X) is the variance of X, usually denoted by σ 2 .
Definition 2.3. The moment generating function of a random variable X if it exists is
defined to be  
MX (t) = E etX , t > 0.
The moment generating function as the name suggests can be used in part to obtain the
moments of a distribution by differentiating it:
(k)  
MX (t) |t=0 = E X k .

As an example, the moment generating function of a normally distributed random vari-


able with mean μ and variance σ 2 is given as
1 2 t2
MX (t) = eμt+ 2 σ .

Another important property is that the moment generating function when it exists is
unique. The moment generating function of the sum of independent random variables is
equal to the product of the individual moment generating functions. This fact, coupled
with the uniqueness property helps to determine the distribution of the sum of the
variables.

7
2. Fundamental Concepts in Parametric Inference

The third and fourth central moments are used to define the skewness and (excess)
kurtosis as
γ = μ3 /σ 3 and κ = μ4 /σ 4 − 3
respectively. The skewness measures the “slant” of a distribution with γ = 0 for a
symmetric distribution. When γ < 0, the distribution is slanted to the left (with a
long tail on the left) and when γ > 0, it is slanted to the right (with a long tail on
the right). The kurtosis measures the “fatness” of the tails of a distribution. A positive
value indicates a heavier tail than that of a normal distribution whereas a negative value
points to a lighter tail.
Knowledge of the mean, variance, skewness, and kurtosis can often be used to ap-
proximate fairly well a given distribution (see Kendall and Stuart (1979)). Table 2.1
exhibits some important pmfs/pdfs along with their mean, variance, skewness, and
(excess) kurtosis. The multinomial distribution generalizes the binomial for the case
of r categories.
Using the linear properties of the expectation operator, we can determine the ex-
pected value and moments of a linear combination of a set of random variables. For
instance, for p random variables X1 , . . . , Xp and constants ai , i = 1, . . . , p, we have,


p

p
E ai X i = ai E [Xi ]
i=1 i=1
 p
p

p

p
V ar( ai X i ) = a2i V ar(Xi ) + 2 ai aj Cov(Xi , Xj ),
i=1 i=1 i=1 j=i+1

where
Cov(Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] .
Some useful results in connection with conditional distributions are in the following
theorem.
Theorem 2.1. Let X and Y be two random variables defined on the same probability
space. Then
(a) E [Y ] = E [E [Y |X]],

(b) V ar [Y ] = E [V ar(Y |X)] + V ar(E [Y |X]).


Example 2.1 (Symbolic Data Analysis). In symbolic data analysis, it is common to
observe a random sample in the form of intervals {(ai , bi ) , i = 1, . . . , n}. Let X be a
random variable whose conditional distribution is uniform on the interval (A, B). In
that case, the conditional mean and variance of X are respectively

(A + B) (B − A)2
, .
2 12

8
Table 2.1.: Some important discrete and continuous random variables
Name pmf / pdf Mean Variance Skewness Kurtosis
Discrete

Uniform on 2
+1)
1 m+1 m2 −1 − 6(m
1, 2, . . . , m m 2 12
0 5(m2 −1)

n x
x
p (1 − p)n−x 1−6p(1−p)
√ 1−2p
Binomial x = 0, 1, . . . , n, 0 ≤ p ≤ 1 np np(1 − p) np(1−p) np(1−p)

n!
px1 px2 2 . . . pxr r ,
xr1 !x2 !...xr ! 1  r
Multinomial i=1 xi = n, i=1 pi = 1

Continuous

9
Uniform on
1 a+b (b−a)2
(a, b) b−a
, a<x<b 2 12
0 − 65
2

√ 1
exp − (x−μ) ,
2πσ 2 2σ 2
Normal −∞ < x, μ < ∞, σ > 0 μ σ2 0 0

1 1
Exponential λe−λx , x ≥ 0,λ > 0 λ λ2
2 6
2. Fundamental Concepts in Parametric Inference

β α α−1 −βx α α √2 6
Gamma Γ(α)
x e , x ≥ 0,α, β > 0 β β2 α α

m x

1 8
m x 2 −1 e− 2 ,x ≥ 0,m > 0 12
Chi-square 22Γ m
2( ) m 2m m
m
Table 2.1.: Continued.
Name pmf / pdf Mean Variance Skewness Kurtosis
Discrete

1
Laplace 2σ
exp − |x−μ|
σ
, −∞ < x, μ < ∞, σ > 0 μ 2σ 2 0 3

10
exp(− x−μ
σ )
2 , −∞ < x, μ < ∞, σ > 0 π2 σ2
Logistic σ (1+exp(− x−μ
σ )) μ 3
0 1.2

Γ( ν+1
− ν+1
2
2 ) x2

πνΓ( ν2 )
1+ ν
ν 6
Student’s t , −∞ < x < ∞, ν > 0 0, ν > 1 ν−2
,ν >2 0 ν−4
,ν >4
2. Fundamental Concepts in Parametric Inference
2. Fundamental Concepts in Parametric Inference

From Theorem 2.1, the unconditional mean is given by


 
A+B
E
2

which can be estimated by 


(ai + bi )
.
2n
The unconditional variance is given by


 
(B − A)2 A+B
E [V ar [X|A, B]] + V ar [E [X|A, B]] = E + V ar
12 2

which can be estimated unbiasedly by


  2 
(bi − ai )2 1 1 
2
+ (ai + bi ) − (ai + bi ) .
12n 4(n − 1) n

We shall also need to make use of the distribution of the order statistics which is
given in the lemma below.

Lemma 2.1. Let {X1 , X2 , . . . , Xn } be a random sample of size n from a cumulative


distribution F (x) with density f (x) . Then the density of the ith order statistic is given by

n!
fX(i) (x) = [F (x)]i−1 f (x) [1 − F (x)]n−i , i = 1, . . . , n. (2.1)
(i − 1)! (n − i)!

Proof. An intuitive proof may be given by using the multinomial distribution. The
probability that X(i) lies in a small interval around x implies that there are (i − 1)
observations to the left of x and (n − i) to the right of x.

2.1.2. Modes of Convergence and Central Limit Theorems


We shall be concerned with three basic modes of convergence: weak convergence (equiv-
alently, convergence in distribution), convergence in probability, and convergence almost
surely. An important tool in proving limit results is the Borel-Cantelli lemma which pro-
vides conditions under which an event will have probability zero. We shall have occasion
to make use of the Borel-Cantelli lemma in Chapter 10 to prove consistency results.

Lemma 2.2 (Borel-Cantelli). Let A1 , . . . , An be a sequence of events in some probability


space such that 
P (Ai ) < ∞.

11
2. Fundamental Concepts in Parametric Inference

Then    

∞ 

P lim sup An =P Ak = 0.
n−→∞
n=1 k≥n

The notation lim supn−→∞ An indicates the set of outcomes which occur infinitely often
in the sequence of events.

We shall say that a sequence of random variables Xn converges weakly or converges


L
in distribution to X, denoted Xn −
→ X, if as n → ∞,

P (Xn ≤ x) → P (X ≤ x)

for all points of continuity x of the cdf of X.


We shall say that a sequence of random variables Xn converges in probability to X,
P
denoted Xn − → X, if as n → ∞

P (|Xn − X| > ε) → 0

for ε > 0.
We shall say that a sequence of random variables Xn converges almost surely to X,
a.s.
denoted Xn −−→ X, if as n → ∞

P lim |Xn − X| > ε = 0
n−→∞

for ε > 0.
Convergence almost surely implies convergence in probability. On the other hand, if
a sequence of random variables converges in probability, then there exists a subsequence
which converges almost surely. As well, convergence in probability implies convergence
in distribution. The following inequality plays a useful role in probability and statistics.

Lemma 2.3 (Chebyshev Inequality). Let X be a random variable with mean μ and
finite variance σ 2 . Then for ε > 0,

σ2
P (|X − μ| ≥ ε) ≤ .
ε2
As an application of Chebyshev’s inequality, suppose that X1 , . . . , Xn is a sequence of
independent identically distributed random variables having mean μ and finite variance
σ 2 . Then, for ε > 0,
  σ2
P X̄n − μ > ε ≤ 2

P
from which we conclude that X̄n −
→ μ as n → ∞. This is known as the Weak Law of
Large Numbers.

12
2. Fundamental Concepts in Parametric Inference

There is as well the Strong Law of Large Numbers (Billingsley (2012), Section 22, p.
301) which states that if X1 , . . . , Xn is a sequence of independent identically distributed
random variables with mean μ for which E |Xi | < ∞, then for ε > 0.
  
 
P lim X̄n − μ > ε = 0.
n−→∞

a.s.
That is X̄n −−→ μ. The single most important result in both probability and statistics
is the central limit theorem (CLT). In its simplest version, it states that if {X1 , . . . , Xn }
is a sequence of independent identically distributed (i.i.d.) random variables with mean
μ and finite variance σ 2 , then for large enough n,

n X̄n − μ L

→ N (0, 1) ,
σ
where N (0, 1) is the standard normal distribution with mean 0 and variance 1. Since the
assumptions underlying the CLT are weak, there have been countless applications as in
for example approximations to the probabilities of various events involving the sample
mean. As well, it has been used to approximate various discrete distributions such as
the binomial, Poisson, and negative binomial. An important companion result is due to
Slutsky (Casella and Berger (2002), p. 239) which can often be used together with the
central limit theorem.
L P
Theorem 2.2 (Slutsky’s Theorem). Suppose that Xn −
→ X, and Yn −
→ c for a constant
c. Then,
L
Xn Y n −
→ cX
and
L
Xn + Y n −
→ X + c.
Moreover, if c = 0,
L
Xn /Yn −
→ X/c.
A direct application of Slutsky’s theorem is as follows. Let {X1 , . . . , Xn } be a se-
quence of i.i.d. random variables having finite variance σ 2 , n > 1 and let
n n 2
Xi i=1 Xi − X̄n
X̄n = i=1
, Sn2 =
n n−1
be the sample mean and sample variance respectively. Then, it can be shown that
 
 2 1 n−3 4
V ar Sn = μ4 − σ
n n−1
 
σ4 2n
= κ+
n n−1

13
2. Fundamental Concepts in Parametric Inference

where μ4 = E [X − μ]4 and κ = μ4


σ4
− 3. It follows from Chebyshev’s inequality that

P
Sn2 −
→ σ2

and hence, from Slutsky’s theorem,


√ √
n X̄n − μ n X̄n − μ σ L
= · −
→ N (0, 1) .
Sn σ Sn

An additional result known as the Delta method enables us to extend the CLT to
functions (Casella and Berger (2002), p. 243).

Theorem 2.3 (Delta Method). Let {Yn } be a sequence of random variables having mean
θ and finite variance σ 2 and for which
√ L
→ N 0, σ 2 , as n → ∞.
n (Yn − θ) −

Let g be a differentiable real valued function such that g  (θ) = 0. Then,


√ 
L 2  2
n (g (Yn ) − g (θ)) −
→ N 0, σ [g (θ)] .

The proof can be obtained by applying the first order Taylor expansion to g (Yn ).

Example 2.2. Let {X1 , . . . , Xn } be a random sample from the Bernoulli distribution
for which Xn = 1 with probability θ and Xn = 0 with probability 1 − θ. Let

g (θ) = θ (1 − θ) .

n(X̄n −θ ) L
The CLT asserts that √ −
→ N (0, 1) , as n → ∞. Then using the Delta method
θ(1−θ)

the asymptotic distribution of g X̄n is
√ L 
 2
n g X̄n − g (θ) −→ N 0, θ (1 − θ) [g (θ)]

provided g  (θ) = 0. We note, however, that at θ = 12 , g  12 = 0. In order to determine

the asymptotic distribution of g X̄n at θ = 12 , we may proceed by using a second-order
Taylor expansion
       2
1  1 1 1  1 1
g X̄n = g +g X̄n − + g X̄n −
2 2 2 2 2 2
 2
1 1
= + 0 − X̄n − .
4 2

14
2. Fundamental Concepts in Parametric Inference

This implies that  


1 L
4n − g X̄n −
→ χ21 , as n → ∞.
4
A weakening of the simple CLT to the case of independent but not necessarily
identically distributed random variables is given by the Lindeberg-Feller theorem.

Theorem 2.4 (Lindeberg-Feller). Suppose that {Xi } are independent random vari-
ables with means {μi }, finite variances {σi2 }, and distribution functions {Fi } . Let Sn =
n n
i=1 (Xi − μi ), Bn =
2 2
i=1 σi and suppose that

σk2
max →0 as n → ∞.
k Bn2

Then for large enough n,


Sn L

→ N (0, 1)
Bn
if and only if for every  > 0,
n  2 
i=1 E (Xi − μi ) I (|Xi − μi | > Bn )
→0 as n → ∞. (2.2)
Bn2

The Lindeberg condition (2.2) is implied by the Lyapunov condition which states that
there exists a δ > 0 such that
n 2+δ
i=1 E |Xi − μi |
→ 0 as n → ∞.
Bn2+δ

Remark 2.1. Note that the condition maxk σk2 /Bn2 → 0 as n → ∞ is not needed for the
proof of the “if” part as it can be derived from (2.2).

Example 2.3. (Lehmann (1975), p. 351) Let {Y1 , . . . , Yn } be a random sample from the
Bernoulli distribution for which Yn = 1 with probability θ and Yn = 0 with probability

1 − θ. Set Xi = iYi . We would like to determine the asymptotic distribution of ni=1 Xi .
We note that μi = iθ and σi2 = i2 θ (1 − θ). Consequently,


n
n (n + 1) (2n + 1)
Bn2 = σi2 = θ (1 − θ) .
i=1
6

On the other hand, since



2 i2 (1 − θ)2 with probability θ
(Xi − μi ) =
i2 θ 2 with probability 1 − θ

15
2. Fundamental Concepts in Parametric Inference

it follows that for sufficiently large n,

(Xi − μi )2 < n2 ,

and consequently,
n  
i=1 E (Xi − μi )2 I (|Xi − μi | > Bn )
lim = 0.
n−→∞ Bn2
 L
Therefore, applying Theorem 2.4, Bn−1 ( ni=1 Xi − θn(n + 1)/2) −
→ N (0, 1) for large n.
We note that in this example, the Lyapunov condition would not be satisfied.

2.2. Multivariate Central Limit Theorem


Definition 2.4. A random p-vector Y is said to have a multivariate normal distribution
with mean vector μ and non-singular variance-covariance matrix Σ if its density is
given by  
−p/2 −1/2 1  −1
f (y) = (2π) |Σ| exp − (y − μ) Σ (y − μ) ,
2
where |A| denotes the determinant of a square matrix A. We write Y ∼ Np (μ, Σ).
An important companion result is the Cramér-Wold device (Serfling, 2009) which
states that a random vector Y has a multivariate normal distribution if and only if every
linear combination c Y for c ∈ Rp has a univariate normal distribution.
Proposition 2.1. If Y ∼ Np (μ, Σ) and Z = AY +b for some q ×p matrix of constants
A of rank q ≤ p, and b is a constant q-vector, then

Z ∼ Nq (Aμ + b, AΣA ) .

The multivariate generalization of the simple univariate central limit theorem is


given in the following theorem.
Theorem 2.5 (Multivariate central limit theorem). Let {Y 1 , . . . , Y n } be a random
sample from some p-variate distribution with mean μ and variance-covariance matrix
Σ. Let
1
n
Ȳ n = Yi
n i=1

be the sample mean. Then, as n→ ∞, the asymptotic distribution of n Ȳ n − μ is
multivariate normal with mean 0 and variance-covariance matrix Σ, i.e.,
√ L
n Ȳ n − μ −→ Np (0, Σ) .

16
2. Fundamental Concepts in Parametric Inference

Corollary 2.1. Let T be an r×p matrix. Then


√ L
n T Ȳ n − T μ −→ Nr (0, T ΣT  ) .

Example 2.4 (Distribution of Quadratic Forms). We cite in this example some well-
known results on quadratic forms in normal variates. Let Y ∼ Np (μ, Σ) where Σ is
positive definite. Then,

(a) AY ∼ Np (Aμ, AΣA ) for a constant matrix A

(b) (Y − μ) Σ−1 (Y − μ) ∼ χ2p

(c) Y  AY ∼ χ2r (δ) if and only if AΣA = A, that is, AΣ is idempotent and rank A =
r. Here, A is a symmetric p × p positive definite matrix of constants,χ2r (δ) is the
noncentral chi-square distribution, and δ = μ Aμ is the noncentrality parameter.

2.3. Concepts in Statistical Inference


Statistics is concerned in part with the subjects of estimation and hypothesis testing. In
estimation, one is interested in both the point-wise estimation and the computation of
confidence intervals for the parameters of a distribution function. In hypothesis testing
one aims to verify a hypothesis about a parameter. In the next section, we briefly review
some important concepts related to these two topics.

2.3.1. Parametric Estimation


An important part of statistics is the subject of estimation. On the one hand, one is
interested in estimating various parameters of a distribution function which characterizes
a certain physical phenomenon. For example, one may be interested in estimating the
average heights of individuals in a population. It is natural then to collect a random
sample of such individuals and to calculate an estimate based on that sample. On the
other hand, one is also interested in determining how accurate that estimate is. We
begin with the definition of the likelihood function which forms the basis of parametric
estimation.
Suppose that X1 , . . . , Xn are random variables having joint density f (x1 , . . . , xn ; θ),
θ ∈ Rp . The likelihood viewed as a function of θ is defined to be the joint density

L (θ) = f (x1 , . . . , xn ; θ) . (2.3)

Most statistical inference is concerned with using the sample to obtain knowledge about
the parameter θ. It sometimes happens that a function of the sample, labeled a statistic,
provides a summary of the data which is most relevant for this purpose.

17
2. Fundamental Concepts in Parametric Inference

Definition 2.5. We say that a statistic T (X1 , . . . , Xn ) is sufficient for θ if the condi-
tional density of X1 , . . . , Xn given T (X1 , . . . , Xn ) is independent of θ.

The factorization theorem characterizes this concept in the sense that T is sufficient
if and only if there exists a function h of t = T (x1 , . . . , xn ) and θ only and a function g
such that
f (x1 , . . . , xn ; θ) = h (t, θ) g (x1 , . . . , xn ) .
The concept of sufficiency allows us to focus attention on that function of the data which
contains all the important information on θ.
There are some desirable properties of estimators which provide guidance on how
to choose among them. An estimator T (X1 , . . . , Xn ) is said to be unbiased for the
estimation of g (θ) if for all θ

Eθ [T (X1 , . . . , Xn )] = g (θ) .

Example 2.5. Let {X1 , . . . , Xn } be a random sample drawn from a distribution with
population mean μ and variance σ 2 . It is easy to see from the properties of the expec-

tation operator that the sample mean X̄n = n1 ni=1 Xi is unbiased for μ:

  1 1
n n
E X̄n = E [Xi ] = μ = μ.
n i=1 n i=1
n
i=1 (Xi − X̄n ) is unbiased for σ since Sn can be
1
Also, the sample variance Sn2 = n−1 2 2 2

reexpressed as  n 
1 
Sn2 = (Xi − μ)2 − n(X̄n − μ)2 ,
n − 1 i=1
and
 
E (Xi − μ)2 = V ar(Xi ) = σ 2
  σ2
E (X̄n − μ)2 = V ar(X̄n ) = .
n
A desirable property of an unbiased estimator is that it should have the smallest
variance
 
Eθ (T (X1 , . . . , Xn ) − g (θ))2 .
A further desirable property of estimators is that of consistency.

Definition 2.6. An estimator T (X1 , . . . , Xn ) of a parameter g (θ) is said to be consis-


tent if
P
T (X1 , . . . , Xn ) −
→ g (θ) , as n → ∞.

18
2. Fundamental Concepts in Parametric Inference

Consistency is the minimal property required of an estimator. In order to illustrate


some of these concepts, we shall consider the exponential family of distributions.
Definition 2.7. The density of a real valued random variable X is said to be a member
of the exponential family if its density is of the form

f (x; θ) = h (x) exp [η (θ) t (x) − K (θ)] . (2.4)

where h (x) , η (θ) , t (x) , K (θ) are known functions. We will assume for simplicity that
η (θ) = θ.
It follows that

Eθ [t (X)] = K  (θ)
V arθ (t(X)) = K  (θ) .

The density f (x; θ) is sometimes called the exponential tilting of h (x) with respect to
the mean K  and variance K. The exponential family includes as special cases several
of the commonly used distributions, such as the normal, exponential, binomial, and
Poisson.
Suppose that we have a random sample from the exponential family. It follows that
the joint density is given by
n


f (x1 , . . . , xn ; θ) = h (x1 , . . . , xn ) exp θ t (xi ) − nK (θ)
i=1

which can again be identified as being a member of the exponential family. An important
result in the context of estimation is the following.
Theorem 2.6 (Cramr-Rao Inequality). Suppose that {X1 , . . . , Xn } is a random sample
from a distribution having density f (x; θ) . Under certain regularity conditions which
permit the exchange of the order of differentiation and integration, the variance of any
estimator T (X1 , . . . , Xn ) is bounded below by

[b (θ)]2
V arθ (T (X1 , . . . , Xn )) ≥ , (2.5)
I (θ)

where b (θ) = Eθ [T (X1 , . . . , Xn )] and


 2 
∂ log f (X1 , . . . , Xn ; θ)
I (θ) = −Eθ . (2.6)
∂θ2

The expression in (2.6) is known as the Fisher information and it plays a key role in
estimation and hypothesis testing. The regularity conditions are satisfied by members

19
2. Fundamental Concepts in Parametric Inference

of the exponential family. Under those conditions, it can be shown that


 
∂f (X1 , . . . , Xn ; θ)
Eθ =0
∂θ

and
 2
 
∂ log f (X1 , . . . , Xn ; θ) ∂ 2 log f (X1 , . . . , Xn ; θ)
I (θ) = Eθ = −Eθ < ∞.
∂θ ∂θ2

Example 2.6. Suppose we have a random sample from the normal distribution with
mean μ and variance σ 2 . Then, X̄n is a consistent and unbiased estimator of the popu-
lation mean μ whose variance is σ 2 /n. The Fisher information can be calculated to be
nσ −2 and hence the Cramr-Rao lower bound for the variance of any unbiased estimator
is σ 2 /n. Consequently, the sample mean has the smallest variance among all unbiased
estimators.

A multi-parameter version of the Cramr-Rao inequality also exists. Suppose that θ


is a p-dimensional parameter. The p × p Fisher information matrix I (θ) has the (i, j)
entry:  2 
∂ log f (X1 , . . . , Xn ; θ)
Iij (θ) = −Eθ .
∂θi ∂θj
provided the derivatives exist.
Let T (X1 , . . . , Xn ) be a vector-valued estimator of θ and let

b (θ) = Eθ [T (X1 , . . . , Xn )] .

Then under certain regularity conditions, the multi-parameter Cramr-Rao lower bound
states that in matrix notation
   
∂b (θ) −1 ∂b (θ)
Covθ (T (X1 , . . . , Xn )) ≥ [I (θ)] . (2.7)
∂θ ∂θ

The matrix inequality above of the form A ≥ B is interpreted to mean that the difference
A−B is positive semi-definite. In general the regularity conditions require the existence
of the Fisher information and demand that either the density function has bounded
support and the bounds do not depend on θ or the density has infinite support, is
continuously differentiable and its support is independent of θ.

Example 2.7. Suppose that in Example 2.6 we would like to estimate θ = (μ, σ 2 )
where both the mean μ and the variance σ 2 are unknown. Let

T (X1 , . . . , Xn ) = X̄n , Sn2

20
2. Fundamental Concepts in Parametric Inference

be the vector of the sample mean and sample variance respectively. We have
 2 
σ
0
I (θ)−1 = n
4 .
0 2σn

It can be shown that X̄n and Sn are uncorrelated and


σ2 2σ 4
V ar X̄n = , V ar Sn2 = , n > 1.
n n−1

Consequently, we conclude that X̄n attains the Cramr-Rao lower bound whereas Sn2
does not.

By far the most popular method for finding estimators is the method of maximum
likelihood developed by R.A. Fisher.

Definition 2.8. An estimator of a parameter θ, denoted θ̂ n , is a maximum likelihood


estimator if it maximizes the likelihood function L (θ) in (2.3).

Provided the derivatives exist, the maximization of the likelihood may sometimes be
done by maximizing instead the logarithm of the likelihood since

∂L (θ) 1 ∂L (θ) ∂ log L (θ)


= 0 =⇒ = = 0.
∂θ L (θ) ∂θ ∂θ

Example 2.8. Let {X1 , . . . , Xn } be a random sample from a normal distribution with
mean μ and variance σ 2 . The log likelihood function is given by

n n (xi − μ)2
log L μ, σ 2
= − log (2π) − log σ 2 − .
2 2 2σ 2
The maximum likelihood equations are then

∂ log L (μ, σ 2 ) (xi − μ)
= =0
∂μ σ2

∂ log L (μ, σ 2 ) n (xi − μ)2
= − + =0
∂σ 2 2σ 2 2σ 4
2
(n−1)Sn
from which we obtain the maximum likelihood estimators X̄n and n
for the mean
and variance respectively.

Example 2.9 (Multivariate Normal Distribution). Suppose that {Y 1 , . . . , Y n } is a ran-


dom sample of size n from the multivariate normal distribution Np (μ, Σ) . The maximum

21
2. Fundamental Concepts in Parametric Inference

likelihood estimators of μ and Σ are given respectively by

1
n
Ȳ n = Yi
n i=1
1  
n
Σ̂ = Yi − Ȳn Yi − Ȳn .
n i=1

Example 2.10. Suppose we have a random sample {X1 , . . . , Xn } from a uniform distri-
bution on the interval (0, θ). Since the range of the density depends on θ we cannot take
the derivative of the likelihood function. Instead we note that the likelihood function is
given by 
1
n max1≤i≤n Xi < θ
L (θ) = θ
0 elsewhere
and hence, the maximum likelihood estimator of θ is max Xi .
1≤i≤n

Example 2.11 (Multinomial Distribution). Consider a generalization of the binomial


distribution which allows for trial one of M possible categorical outcomes. Suppose
that at the kth trial we observe the jth outcome, k = 1, . . . , n, j = 1, . . . , M . Let
X k = (0, . . . , 0, 1, 0, . . . , 0) be the M -dimensional vector which records of 1 at the jth
position if jth categorical outcome is observed and 0 otherwise. Let the frequency vector
N = (N1 , . . . , Nk ) denote the number of occurrences of the categories in n repetitions of
the same experiment conducted under independent and identical conditions (i.e., i.i.d.).
It follows that
n
N= X k.
k=1

Denote the probability vector by p = (p1 , . . . , pM ) where pj is the probability of observing


the jth outcome. The probability distribution associated with N is given by

n! 
M 
M
P (Nj = nj , j = 1, . . . , M ) = pn1 . . . pnMM , pj = 1, nj = n,
n1 . . . , n M 1 j=1 j=1

and it is called the multinomial distribution. The {X k } are i.i.d. with covariance matrix
having (i, j) entry σij given by

pi (1 − pi ) i = j
σij =
−pi pj i = j

22
2. Fundamental Concepts in Parametric Inference

Also the covariance matrix of N is not of full rank and is given in matrix notation by

Σ = Cov (N ) = n [(diag (p)) − pp ] .

It can be shown that for large n,

1
√ (N − np) → N (0, Σ) .
n

2.3.2. Hypothesis Testing


Suppose that {X1 , . . . , Xn } is a random sample from a distribution where the density
of X is given by f (x; θ) , θ ∈ Rp . In hypothesis testing it is common to formulate
two hypotheses about θ: the null hypothesis denoted H0 represents the status quo (no
change). The alternative, denoted H1 , represents the hypothesis that we are hoping to
accept. To illustrate, suppose that we are interested in assessing whether or not a new
drug represents an effective treatment compared to a placebo. The null hypothesis states
that the new drug is not more effective than a placebo, whereas the alternative states
that it is. Since the assessment is made on the basis of evidence contained in a random
sample, there is the possibility of error. A Type I error results when one falsely rejects
the null hypothesis in favor of the alternative. A Type II error occurs when one falsely
accepts the null hypothesis when it is false. In our example, a Type I error means we
would incorrectly adopt an ineffective drug whereas a Type II error means we incorrectly
forgo an effective drug. Both types of error cannot simultaneously be controlled. It is
customary to prescribe a bound on the probability of committing the Type I error and
to then look for tests that minimize the Type II error. The Neyman-Pearson lemma
makes concrete these ideas.
Formally we are interested in testing the null hypothesis

H0 : θ ∈ Θ 0

against the alternative hypothesis

H1 : θ ∈ Θ 1 ,

where Θ0 and Θ1 are subsets of the parameter space. In the situation where both Θ0 , Θ1
consist of single points θ0 and θ1 respectively, the Neyman-Pearson lemma provides
an optimal solution. Set x = (x1 , . . . , xn ) and let φ (x) be a critical function which
represents the probability of rejecting the null hypothesis when x is observed. Also let
α denote the prescribed probability of rejecting the null hypothesis when it is true (also
known as the size of the test).

23
2. Fundamental Concepts in Parametric Inference

Lemma 2.4 (Neyman-Pearson Lemma). Suppose that there exists some dominating
measure μ with respect to which we have densities f (x; θ0 ) and f (x; θ1 ) . Then the most
powerful test of H0 : θ = θ0 against H1 : θ = θ1 is given by
⎧ n

⎪ 1 if i=1
f (xi ;θ1 )
>k
⎨ n
i=1
n
f (xi ;θ0 )
f (xi ;θ1 )
φ(x) = γ if i=1 n =k

⎪  i=1 f (xi ;θ0 )
⎩0 if i=1 f (xi ;θ1 ) < k
n
n
i=1 f (xi ;θ0 )

where k is chosen such that Eθ0 [φ (X)] = α. The power function of the test φ is defined
to be  " n
Eθ [φ (X)] = φ (X) f (xi ; θ) dx1 . . . dxn
i=1

and it represents the probability of rejecting the null hypothesis for a given θ.

Example 2.12. Given a random sample {X1 , . . . , Xn } from a normal distribution with
unknown mean μ and known variance σ 2 , suppose that we are interested in testing the
null hypothesis
H0 : μ = μ0
against the alternative hypothesis

H1 : μ = μ1 > μ0 .

Then it can be shown that the uniformly most powerful test is given by



⎨1 X̄n > k
φ (x) = γ X̄n = k


⎩0 X̄n < k

where k = μ0 + zα √σn , and zα is the upper α-point of the standard normal distribution.

Example 2.13. In the case of a random sample from an exponential distribution with
mean θ, describing the lifetimes of light bulbs, the uniformly most powerful test of

H 0 : θ = θ0

against the alternative hypothesis

H 1 : θ = θ1 < θ0 .

24
2. Fundamental Concepts in Parametric Inference

is given by ⎧


⎨1 X̄n > k
φ (x) = γ X̄n = k


⎩0 X̄n < k
where k is a solution of Γ(n, nk/θ0 ) = α(n − 1)!. Here, Γ(a, b) is an upper incomplete
∞
gamma function defined as b ua−1 e−u du.
Uniformly most powerful (UMP) tests rarely exist in practice. An exception occurs
when the family of densities possesses a monotone likelihood ratio.

Definition 2.9. We shall say that a family of densities {f (x; θ) , θ ∈ Θ} has monotone
likelihood ratio if the ratio ff (x;θ 2)
(x;θ1 )
is nondecreasing in some function T (x) for all θ1 < θ2
in some interval Θ.

The exponential family (2.4) has likelihood ratio

f (x; θ2 )
= exp {(θ2 − θ1 ) t (x) − (K (θ2 ) − K (θ1 ))}
f (x; θ1 )

which is nondecreasing in x provided t (x) is nondecreasing. In the situation where the


family of densities has monotone likelihood ratio, it can be shown that there exists uni-
formly most powerful tests (Ferguson (1967), p. 210). Specifically, a test φ0 is uniformly
most powerful of size α if it has size α and if the power function satisfies

Eθ [φ0 (X)] ≥ Eθ [φ (X)] , θ ∈ Θ1

As an application, we consider testing for the mean of a normal distribution the


hypothesis
H0 : μ = μ0
against the alternative hypothesis

H 1 : μ > μ0 .

The family has monotone likelihood ratio in x. The uniformly most powerful test of

H0 : μ = μ0

against the alternative hypothesis

H 1 : μ = μ1 > μ0

25
2. Fundamental Concepts in Parametric Inference

for μ1 fixed is given by the Neyman-Pearson lemma becomes





⎨1 X̄n > k
φ (x) = γ X̄n = k


⎩0 X̄ < k
n

This test has nondecreasing power function


 √ 
n (μ − μ0 )
Φ −zα + ≥ α, for μ > μ0 .
σ
and hence it is uniformly most powerful when the alternative is the composite μ > μ0 .
A generalization of the Neyman-Pearson lemma when both Θ0 and Θ1 are composite
sets is based on the likelihood ratio
#
supθ∈Θ0 ni=1 f (xi ; θ)
Λn = # with Θ = Θ0 ∪ Θ1 .
supθ∈Θ ni=1 f (xi ; θ)

The likelihood ratio test rejects the null hypothesis whenever Λn is small enough. The
factorization theorem shows that the likelihood ratio Λn is based on the sufficient statistic
and moreover, it is invariant under transformations of the parameter space that leave
the hypothesis and alternative hypotheses invariant. For a random sample of size n from
the exponential family (2.4) we have

log Λn = n sup inf [(θ0 − θ) t̄n − K (θ0 ) + K (θ)] ,


θ0 ∈Θ0 θ∈Θ


where t̄n = ni=1 t (xi ) /n.
In certain situations, as in the case of a Cauchy location family, a uniformly powerful
test may not exist. On the other hand, a locally most powerful test which maximizes the
power function at θ0 may exist. Provided we may differentiate under the integral sign
the power function with respect to θ we see upon using the generalized Neyman-Pearson
lemma (see Ferguson (1967), p. 204) that a test of the form
⎧ #n #n

⎪ 1 if ∂
f (x ; θ ) > k
⎨ ∂θ i=1 i 0 i=1 f (xi ; θ0 )

# n # n
φ(x) = γ if ∂θ i=1 f (xi ; θ0 ) = k i=1 f (xi ; θ0 )

⎪ # #
⎩0 if ∂ n f (x ; θ ) < k n f (x ; θ )
∂θ i=1 i 0 i=1 i 0

maximizes the power function at θ0 among all tests of size α.

26
2. Fundamental Concepts in Parametric Inference

Definition 2.10. Let {X1 , . . . , Xn } be a random sample from some distribution hav-
ing density f (x; θ) , θ ∈ Rp . Let L (θ; x) be the likelihood function where x =
(x1 , . . . , xn ) . Let
 (θ; x) = log L (θ; x)
The derivative
∂ (θ; x)
U (θ; x) =
∂θ
is called the score function.

The locally most powerful test can be seen to be equivalently based on the score
function since

∂ " n
U (θ; x) = log f (xi ; θ 0 )
∂θ i=1
#n

f (xi ; θ 0 )
= ∂θ#n i=1 .
i=1 f (xi ; θ 0 )

2.3.2.1. The Three “Amigos”


Lemma 2.5. Let U (θ; X) = ∂θ ∂
 (θ; X) where θ is possibly a vector. Denote by f (x; θ)
the joint density of X1 , . . . , Xn . Then, if we can differentiate under the integral signs
below, the following properties hold:

(i) Eθ [U (θ; X)] = 0 and

(ii) Covθ [U (θ; X)] = I (θ)

Proof. (i) Note that



∂ (θ; x)
Eθ [U (θ; X)] = f (x; θ) dx
∂θ

∂f (x; θ)
= dx
∂θ


= f (x; θ) dx = 0
∂θ
 
(ii) The (i, j) term of the matrix Covθ [U (θ; X)] is given by Eθ ∂ (θ;X) ∂ (θ;X)
∂θi ∂θj
. Since

∂ ∂ (θ; x)
0 = f (x; θ) dx
∂θi ∂θj
 2 
∂  (θ; x) ∂ (θ; x) ∂
= f (x; θ) dx + f (x; θ) dx
∂θi ∂θj ∂θj ∂θi

27
2. Fundamental Concepts in Parametric Inference
 
∂ 2  (θ; x) ∂ (θ; x) ∂ (θ; x)
= f (x; θ) dx + f (x; θ) dx
∂θi ∂θj ∂θj ∂θj

the result follows.

For any hypothesis testing problem, there are three distinct possible test statistics: the
likelihood ratio test, the Wald test, and the Rao score test, all of which are asymptotically
equivalent as the sample size gets large. In the lemma below, we outline the proof for
the asymptotic distribution of the Rao score test.

Lemma 2.6 (Score Test). Let {X1 , . . . , Xn } be a random sample from a continuous
distribution having density f (x; θ) , θ ∈ Rp and suppose we wish to test

H0 : θ ∈ Θ 0

against the alternative


H1 : θ ∈ Θ 1 .
Let θ̂ 0 be the maximum likelihood estimate of θ under H0 . The test statistic under H0
  
U  θ̂ 0 ; X I −1 θ̂ 0 U θ̂ 0 ; X

has, as the sample size gets large, under the null hypothesis a χ2k distribution where k
is the number of constraints imposed by the null hypothesis.

Proof. The result follows from the multivariate central limit theorem since the score is
a sum of independent identically distributed random vectors

∂ "n
U (θ; X) = log f (Xi ; θ)
∂θ i=1
n  

= log f (Xi ; θ)
i=1
∂θ

with mean 0 and covariance matrix I (θ) .

Theorem 2.7 (The Three Amigos). Let X = {X1 , . . . , Xn } be a random sample from a
continuous distribution having density f (x; θ) , θ ∈ Θ, the parameter space. Suppose we
are interested in testing the general null hypothesis H0 : θ ∈ Θ0 against the alternative
H1 : θ ∈ Θ1 = Θ − Θ0 . Then, the likelihood test, the Wald test, and the score test all
reject the null hypothesis for large values and are asymptotically distributed as central
chi-square distributions with k degrees of freedom as n → ∞ where k is the number of
constraints imposed by the null hypothesis. Specifically,

28
2. Fundamental Concepts in Parametric Inference

(a) The likelihood ratio test rejects whenever,


⎡ ⎤
L θ̂ 0 ; X
−2 log ⎣  ⎦ > χ2k (α) ;
L θ̂; X

(b) The Wald test rejects whenever


  
W = θ̂ − θ 0 I θ̂ θ̂ − θ 0 > χ2k (α) ;

(c) The score test rejects whenever


  
U  θ̂ 0 ; X I −1 θ̂ 0 U θ̂ 0 ; X > χ2k (α) ,

where θ̂ 0 and θ̂ are the maximum likelihood estimates of θ under H0 and H0 ∪ H1


respectively.

Proof. By expanding the likelihood L θ̂; X around θ ∈ Θ0 in a second order Taylor
series, it can be shown that these three tests are equivalent (see van der Vaart (2007),
p. 231 or Cox and Hinkley (1974), p. 315 & p. 324).

We note that it is possible to substitute the matrix of second partials


 of the log
likelihood evaluate at θ̂ for the theoretical Fisher information I θ̂ . The score test
is used since it may be easier to maximize the log likelihood subject to constraints as
we shall see later on. The likelihood ratio test requires computation of both θ̂ 0 and θ̂.
The Wald test requires computation of θ̂ whereas the score test requires computation of
θ̂ 0 . All of these tests can be inverted to provide confidence regions for θ. In that case,
the Wald test will lead to ellipsoidal regions. Finally, we see that the Wald test can be
used to test a single hypothesis on multiple parameters as for example H0 : θ = θ 0 or
H0 : Aθ = b.
We illustrate the three tests in the following examples.

Example 2.14. Let {X1 , . . . , Xn } be a random sample from the Poisson distribution
with mean θ > 0 given by

e−θ θx
f (x; θ) = , x = 0, 1, . . . , θ > 0.
x!
Suppose we wish to test
H 0 : θ = θ0

29
2. Fundamental Concepts in Parametric Inference

against the alternative


H1 : θ = θ0 .
Then the score function evaluated at θ0 is
n
U θ0 ; X̄ = X̄ − θ0
θ0

whereas the Fisher information is given by I (θ) = nθ . Hence the score test rejects
whenever  
θ0 n 2 n 2
X̄ − θ0 = X̄ − θ0 > χ21 (α)
n θ0 θ0
The Wald test rejects whenever
n 2
X̄ − θ0 > χ21 (α) .

The likelihood ratio test rejects the null hypothesis whenever

L (θ0 ; X)
− 2 log  > χ21 (α) (2.8)
L θ̂; X

It can be seen that


 
L (θ0 ; x) θ0
log  = nx̄ log − n (θ0 − x̄) (2.9)
L θ̂; x x̄

Example 2.15. Let {X1 , . . . , Xn } be a random sample from the normal distribution
with mean μ and variance σ 2 . Suppose we wish to test

H0 : μ = μ0

against the alternative


H1 : μ = μ0 .
Then the score function evaluated at μ0 is
n
U μ0 ; X̄ = 2 X̄ − μ0
σ
whereas the Fisher information is given by I (θ) = σn2 . Hence the score test rejects
whenever
σ2  n 2 n 2
2
X̄ − μ 0 = 2
X̄ − μ 0 > χ21 (α) .
n σ σ
n−1 2
Since σ 2 is unknown, we may replace it by its maximum likelihood estimator n
Sn and

30
2. Fundamental Concepts in Parametric Inference

hence the score test rejects whenever


 2
n2 X̄ − μ0
> χ21 (α) .
n−1 Sn

Notes

1. The score test geometrically represents the slope of the log-likelihood function
evaluated at θ̂ 0 . It is locally most powerful for small deviations from θ 0 . This can
be seen from a Taylor series expansion of the likelihood function L (θ 0 + h) around
θ 0 for small values of the vector h.

2. The Wald test is not invariant under re-parametrization unlike the score test which
is (Stein, 1956). It is easy to construct confidence intervals using the Wald statistic.
However, since the standard error is calculated under the alternative, it may lead
to poor confidence intervals.

3. Both the likelihood ratio test and the Wald test are asymptotically equivalent
to the score test under the null as well as under Pitman alternative hypotheses
(Serfling (2009), p. 155). All the statistics have limiting distributions that are
weighted chi-squared if the model is misspecified. See (Lindsay and Qu, 2003) for
a more extensive discussion of the score test and related statistics.

4. An interesting application of the score test is given by (Jarque and Bera (1987))
who consider the Pearson family of distributions.

In the next section we study the concept of contiguity and its consequences.

2.3.2.2. Contiguity
In the previous section, we were concerned with determining the asymptotic distribution
of the test statistics under the null hypothesis. It is of interest for situations related to the
efficiency of tests, to determine the asymptotic distribution of the test statistic under
the alternative hypothesis as well. Since most of the tests considered in practice are
consistent, they tend to perform well for alternatives “far away” from the null hypothesis.
Consequently, the interest tends to focus on alternatives which are “close” to the null
hypothesis in a manner to be made more precisely here. The concept of contiguity
was introduced by Le Cam in connection with the study of local asymptotic normality.
Specifically, the concept enables one to obtain the limiting distribution of a sequence of
statistics under the alternative distribution from knowledge of the limiting distribution
under the null hypothesis distribution. We follow closely the development and notation
in (Hájek and Sidak (1967); van der Vaart (2007)) and we begin with some definitions.

31
2. Fundamental Concepts in Parametric Inference

Definition 2.11. Suppose that P and Q are two probability measures defined on the
same measurable space (Ω, A) . We shall say that Q is absolutely continuous with respect
to P , denoted Q P if Q (A) = 0 whenever P (A) = 0 for all measurable sets A ∈ A.
If Q P , then we may compute the expectation of a function f (X) of a random
 
vector X under Q from knowledge of the expectation under P of the product f (X) dQ
dP
through the change of measure formula
 
dQ
EQ [f (X)] = EP f (X)
dP
 
The expression dQdP
is known as the Radon-Nikodym derivative. The asymptotic version
of absolute continuity is the concept of contiguity.
Definition 2.12. Suppose that Pn and Qn are two sequences of probability measures
defined by the same measurable space (Ωn , An ) . We shall say that Qn is contiguous with
respect to Pn , denoted Qn  Pn if Qn (An ) → 0 whenever Pn (An ) → 0 for all measurable
sets An ∈ An . The sequences Pn , Qn are mutually contiguous if both Qn  Pn and
Pn  Qn .
Example 2.16. Suppose that under Pn we have a standard normal distribution whereas
under Qn we have a normal distribution with mean μn → μ, and variance σ 2 > 0. Then
it follows that Pn and Qn are mutually contiguous. Note however that if μn → ∞, then
for An = [μn , μn + 1] , we have that Pn (An ) → 0, yet Qn (An ) →constant.
Definition 2.13. Le Cam proposed three lemmas which enable us to verify contiguity
and to obtain the limiting distribution under one measure given the limiting distribution
under another. Suppose that Pn and Qn admit densities pn , qn respectively and define
the likelihood ratio Ln for typical points x in the sample space



qn (x)
⎨ pn (x) pn (x) > 0
Ln (x) = 1 pn (x) = qn (x) = 0


⎩0 pn (x) = 0 < qn (x)

Lemma 2.7 (Le Cam’s First Lemma). Let Fn be the cdf of Ln which converges weakly
under Pn at continuity points to a distribution F for which
 ∞
xdF (x) = 1.
0

Then the densities qn are contiguous to the densities pn , n ≥ 1.


2
As a corollary, if log Ln is under Pn asymptotically normal with mean − σ2 and
variance σ 2 , then the densities qn are mutually contiguous to the densities pn . This

32
2. Fundamental Concepts in Parametric Inference

follows from the fact that the moment generating function of a normal (μ, σ 2 ) evaluated
at t = 1 must equal 1; that is for large n,
  σ2
E elog Ln → eμ+ 2
σ2 2
+ σ2
= e− 2 = 1.

Le Cam’s second lemma provides conditions under which a log likelihood ratio is
asymptotically normal under the hypothesis probability measure.
Let x = (x1 , . . . , xn ) and

"
n
pn (x) = fni (xi )
i=1

and
"
n
qn (x) = gni (xi )
i=1

Then

n
log Ln (X) = log [gni (Xi ) /fni (Xi )]
i=1

Let n (
 1
)
Wn = 2 [gni (Xi ) /fni (Xi )] 2 − 1
i=1

Lemma 2.8 (Le Cam’s Second Lemma). Suppose that the following uniform integrability
condition holds

lim max Pn (|gni (Xi ) /fni (Xi ) − 1| > ε) = 0, for all ε > 0
n→∞ 1≤i≤n

2

and Wn is asymptotically N − σ4 , σ 2 under Pn . Then
  
 σ 2
lim Pn log Ln (X) − Wn +  > ε = 0, ε > 0
n→∞ 4
2 
and under Pn , log Ln (X) is asymptotically N − σ2 , σ 2 .

The third lemma which is most often utilized establishes the asymptotic distribution
under the alternative hypothesis provided the measures are contiguous.

Lemma 2.9 (Le Cam’s Third Lemma). Let Pn , Qn be sequences of probability measures
on measurable spaces (Ωn , An ) such that Qn  Pn . Let Yn be a sequence of k-dimensional

33
2. Fundamental Concepts in Parametric Inference

random vectors such that under Pn


   
L μ Σ τ
(Yn , log Ln (X)) −
→ Nk+1 2 , .
− σ2 τ  σ2

Then, under Qn
L
Yn −
→ Nk (μ + τ, Σ)

We may now consider an important application of Le Cam’s lemmas. Suppose that


we have two sequences of probability measures defined by densities pθ , pθ+h/√n .
Let θ (x) = log pθ (x). Then under suitable regularity conditions, a second order
Taylor series expansion yields under pθ


" pθ+h/√n (Xi )  1  ˙


n
h I θ h
log =√ h θ (Xi ) − + o (1) (2.10)
i
p θ (X i ) n i=1
2

where ˙θ (Xi ) is the k-dimensional vector of partial derivatives. The expansion in (2.10) is
known as the local asymptotic normality property (LAN). It follows from the central limit 
h I θ h 
theorem that the right-hand side in (2.10) is asymptotically normal N − 2 , h Iθ h
and consequently, by Le Cam’s first lemma, the measures pθ , pθ+h/√n on the left-hand
side are contiguous. Next, suppose that we have a sequence of test statistics Tn which
are to first order, sums of i.i.d. random variables:

1 
n

n (Tn − μ) = √ ψ (Xi ) + opθ (1)
n i=1

 ˙

for some function ψ. Consequently, the joint distribution of √1 ψ (Xi ) , θ (Xi ) is
i n
under pθ asymptotically multivariate normal
 
    
√ " pθ+h/√n (Xi )  L 0 Σ τ
n (Tn − μ) , log −
→ Nk+1 h  Iθ h ,
i
p θ (X i ) − 2
τ  h  Iθ h

It then follows from Le Cam’s third lemma that n (Tn − μ) is asymptotically normal
under pθ+h/√n .
We may demonstrate contiguity in the case of an exponential family of distributions.

Example 2.17. Let X = (X1 , . . . , Xn ) be a random sample from the exponential family
for a real valued parameter θ

f (x; θ) = exp [θt (x) − K (θ)] g (x)

34
2. Fundamental Concepts in Parametric Inference

The log likelihood and its derivatives are respectively


n


 (θ; X) = θ t (Xi ) − nK (θ)
i=1


n
 (θ; X) = t (Xi ) − nK  (θ)
i=1
 (θ; X) = −nK  (θ)

It follows that
n 

" pθ+h/√n (Xi )  h 


n  
h
 
log = √ t (Xi ) − n K θ + √ − K (θ)
i=1
p θ (X i ) n i=1
n
h 
n
h2
= √ [t (Xi ) − K  (θ)] − K  (θ) + o (1)
n i=1 2

and consequently, the measures pθ = f (x; θ) , pθ+h/√n = f x; θ + √h
n
are contiguous.

The concept of contiguity will enable us to obtain the non-null asymptotic distribu-
tion of various linear rank statistics to be defined in later chapters.

2.3.3. Composite Likelihood


Composite likelihood is an approach whereby one can combine together by multiplying
several component likelihoods each of which is either a conditional or a marginal density.
The components are not necessarily independent and consequently, the resulting product
itself may or may not be a likelihood function in its own right. The advantage of using
composite likelihood methods is that it facilitates the modeling of dependence in smaller
dimensions. We review below some of the basic properties. For a more detailed review,
we refer the reader to Varin et al. (2011).
Let X be an m-dimensional random vector with density fX (x; θ) where θ is a k-
dimensional vector parameter taking values in Θ. Let {A1 , . . . , AK } be a set of marginal
or conditional events (subsets of the sample space) such that the corresponding likelihood
function is given by
Lk (θ; x) = f (x ∈ Ak ; θ) .
A composite likelihood is then defined to be the weighed product

"
K
LC = (Lk (θ; x))wk
k=1

35
2. Fundamental Concepts in Parametric Inference

where the {wk } are nonnegative weights selected to improve efficiency. The composite
log-likelihood is

K
cl (θ; x) = wk k (θ; x)
k=1

with
k (θ; x) = log Lk (θ; x) .
In the simplest case of independence, we have a genuine likelihood function

"
m
Lind (θ; x) = fX (xr ; θ) .
r=1

This composite likelihood leads to inference on the parameters θ as detailed in the


previous sections. In cases where our interest is in parameters involving dependence, we
shall consider likelihoods which for a fixed x are based on the pairwise differences

"
m "
m
fX (xr − xs ; θ) .
r=1 s
=r

Similarly, we may be interested in likelihoods of the form

"
m−1 "
m
Ldif f (θ; x) = fX (xr − xs ; θ) .
r=1 s=r+1

Some asymptotic theory is available when we have a random sample. Let θ̂ CL be


the maximum with respect to θ of the composite log likelihood cl (θ; x). Define the
following quantities.


U (θ; x) = cl (θ; x) (2.11)
∂θ  

H (θ) = Eθ − U (θ; X) (2.12)
∂θ
J (θ) = varθ {U (θ; X)} (2.13)
−1
G (θ) = H (θ) J (θ) H (θ) (2.14)
 

I (θ) = varθ log f (X; θ) (2.15)
∂θ

where G (θ) is the Godambe information matrix in a single observation. If cl (θ; x) is a


true log-likelihood, then G (θ) = I (θ). Then under some regularity conditions on the

36
2. Fundamental Concepts in Parametric Inference

log densities, we have that as n → ∞,


√ 
L
n θ̂ CL − θ −→ Np 0, G−1 (θ) ,

where Np (μ, G−1 (θ)) is a p-dimensional normal distribution with vector mean μ and
variance-covariance matrix G−1 (θ).

2.3.4. Bayesian Methods


In traditional parametric statistical inference, the probability joint density function of a
random sample X = (X1 , . . . , Xn ) , say f (x|θ) is a function of a parameter θ assumed
to be an unknown but fixed constant. A radical departure from this paradigm consists
of treating θ as a random variable itself with its own prior distribution function given
by p (θ; α) where α is a prespecified hyper-parameter. The marginal density of X is
calculated to be 
f (x|θ) p (θ; α) dθ

and the conditional density of θ given X = x, labeled the posterior density of θ is given
by Bayes’ theorem:
f (x|θ) p (θ; α)
p (θ|x) = 
f (x|θ) p (θ; α) dθ
In view of the factorization theorem in Section 2.3.1, the posterior density is always a
function of the sufficient statistic. The use of Bayesian methods enables us to update
information about the prior. There have been countless applications of Bayesian infer-
ence in practice. We refer the reader to Box and Tiao (1973) for further details. We
provide below some simple examples, whereas in Part III of this book, we consider a
modern application of Bayesian methods to ranking data.

Example 2.18. Let x = {x1 , . . . , xn } be a set of observed data randomly drawn from
a normal distribution with mean θ and variance σ 2 . Assume that θ is itself a random
variable having a normal prior distribution with unknown mean μ and known variance
τ 2 . Then the posterior distribution of θ given x is proportional to

"
n
p (θ|x) ≈ f (xi |θ) p (θ; α)
i=1
   −2 
nσ −2 2 τ 2
≈ exp − (x̄n − θ) exp − (θ − μ)
2 2
 −2 
τn 2
≈ exp − (θ − μn )
2

37
2. Fundamental Concepts in Parametric Inference

where
nσ −2 x̄n + τ −2 μ
μn =
nσ −2 + τ −2
τn−2 = nσ −2 + τ −2

We recognize therefore this posterior to be a normal density with mean μn and vari-
ance τn2 .

We provide another example below.

Example 2.19. Let {X1 , . . . , Xn } be a random sample from a Bernoulli distribution


with parameter 0 < θ < 1:

f (x|θ) = θx (1 − θ)1−x , x = 0, 1.

Suppose that the prior for θ is the Beta distribution with parameters α > 0, β > 0

Γ (α + β) α−1
p (θ; α, β) = θ (1 − θ)β−1 , 0 < θ < 1.
Γ (α) Γ (β)

Then the posterior distribution of θ given x is again a Beta distribution but with pa-
rameters  
αn = α + xi , βn = β + n − xi .

In any given problem involving Bayesian methods, the specification of the prior
and the consequent computation of the posterior may pose some difficulties. In certain
problems, this difficulty is overcome if the prior and the posterior come from the same
family of distributions. The prior in that case is called a conjugate prior . In the previous
example, both the prior and the posterior distributions were normal distributions but
with different parameters. In many modern problems not involving conjugate priors,
the Bayesian computation of the posterior distribution is a challenging task. The goal
in many instances is to compute the expectation of some function h (θ) with respect to
the posterior distributions:
 
h (θ) f (x|θ) p (θ; α) dθ
E[h(θ)|x] = h(θ)p(θ|x)dθ = 
f (x|θ) p (θ; α) dθ

We encounter such integrals for example when one is interested in the posterior proba-
bility that h(θ) lies in an interval. If one can draw a random sample θ(1) , θ(2) , . . . , θ(m)
from p(θ|x), the strong law of large numbers guarantees that E[h(θ)|x] is well approxi-

mated by the sample mean of h(θ): m1 m (i)
i=1 h(θ ) if m is large enough. What happens
if p(θ|x) is hard to sample? One way of proceeding is to approximate the posterior dis-
tribution by a multivariate normal density centered at the mode of θ obtained through

38
2. Fundamental Concepts in Parametric Inference

an optimization method (see Albert (2008) p. 94). Another approach is to generate sam-
ples from the posterior distribution of θ indirectly via various simulation methods such
as importance sampling, rejection sampling, and Markov Chain Monte Carlo (MCMC)
methods. We describe these methods below.

1. Importance Sampling.
When it is hard to sample from p(θ|x) directly, one can resort to importance
sampling. Suppose we are able to sample from another distribution q(θ). Then
 
p(θ|x)
Ep [h(θ)] = h(θ)p(θ|x)dθ = h(θ) q(θ)dθ = Eq [h(θ)w(θ)],
q(θ)

with a weight w(θ) = p(θ|x)/q(θ). One can draw a random sample θ(1) , θ(2) , . . . , θ(m)
from q(θ) and E[h(θ)|x] can be approximated by h(θ)* = 1 m h(θ(i) )w(θ(i) ). In
m i=1
practice, q(θ) should be chosen so that it is easy to be sampled and can achieve
a small estimation error. For Monte Carlo error, one can choose q(θ) to minimize
* see Robert and Casella (2004).
the variance of h(θ),

2. Rejection Sampling.
Rejection sampling consists of identifying a proposal density say q (θ) which “re-
sembles” the posterior density with respect to location and spread and which is
easy to sample from. As well, we ask that for all θ and a constant c,

p (θ|x) ≤ cq (θ)

Rejection sampling then proceeds by repeatedly generating independently a ran-


dom variable from both q and a uniformly distributed random variable U on the
interval (0, 1) . We accept θ as coming from the posterior if and only if

p (θ|x)
U≤ .
cq (θ)

We may justify the procedure as follows. Suppose in general that Y has density
fY (y) and V has density fV (v) with common support with

fY (y)
M = sup
y fV (y)

To generate a value from fY , we consider the following rejection algorithm:


(a) Generate instead a uniform variable U and the variable V from fV indepen-
dently.
1 fY (V )
(b) If U < M fV (V )
, set Y = V. Otherwise, return to (a).

39
2. Fundamental Concepts in Parametric Inference

It is easy to see that the value of Y generated from this rejection algorithm is
distributed as fY :
 
1 fY (V )
P (Y ≤ y) = P V ≤ y|U <
M fV (V )

P V ≤ y, U < M1 ffVY (V )
(V )
= 
P U < M1 ffVY (V )
(V )

 y  M1 ffVY (v)
(v)
−∞ 0 fV (v) dudv
=
∞  1 fY (v)
M fV (v)
−∞ 0 fV (v) dudv
 y
= fY (v) dv.
−∞

3. Markov Chain Monte Carlo Methods (MCMC).

- The Gibbs Sampling algorithm was developed by Geman and Geman (1984). See
Casella and George (1992) for an intuitive exposition of the algorithm. This algo-
rithm helps generate random variables indirectly from the posterior distribution
without relying on its density. The following describes the algorithm.
Suppose we have to estimate m parameters θ1 , θ2 , . . . , θm−1 and θm . Let p(θi |
(t)
x, θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θm ) be the full conditional distribution of θi and θi be
the random variate of θi simulated in the tth iteration. Then the procedure of
Gibbs Sampling is
(0) (0) (0)
(a) Set initial values for the m parameters {θ1 , θ2 , . . . , θm }.
(b) Repeat the steps below. At the tth iteration,
(t) (t) (t−1) (t−1) (t−1) (t−1)
(1) draw θ1 from p(θ1 | x, θ2 , θ3 , θ4 , . . . , θm )
(t) (t) (t) (t−1) (t−1) (t−1)
(2) draw θ2 from p(θ2 | x, θ1 , θ3 , θ4 , . . . , θm )
(t) (t) (t) (t) (t−1) (t−1)
(3) draw θ3 from p(θ3 | x, θ1 , θ2 , θ4 , . . . , θm )
..
.
(t) (t) (t) (t) (t) (t)
(m) draw θm from p(θm | x, θ1 , θ2 , θ3 , . . . , θm−1 )
Tierney (1994) showed that if this process is repeated many times, the random
variates drawn will converge to a single variable drawn from the joint posterior
distribution of θ1 , θ2 , . . . , θm−1 and θm given x. Suppose that the process is re-
peated M + N times. In practice, one can discard the first M iterates as the
burn-in period. In order to eliminate the autocorrelations of the iterates, one in

40
2. Fundamental Concepts in Parametric Inference

every a observations is kept in the last N iterates. The choice of M can be deter-
mined by examining the plot of traces of the Gibbs iterates for stationary of the
Gibbs iterates. The value of a could be selected by examining the autocorrelation
plots of the Gibbs iterates.

- The Metropolis-Hastings (M-H) algorithm was developed by Metropolis et al.


(1953) and subsequently generalized by Hastings (1970). The algorithm gives
rise to the Gibbs sampler as a special case (Gelman, 1992). This algorithm is
similar to that of the acceptance-rejection method which either accepts or rejects
the candidate except that the value of the current state is retained when rejection
occurred in the M-H algorithm. For more details, see Liang et al. (2010).
When a random variable θ follows a nonstandard posterior distribution p(θ|x),
the M-H algorithm helps to simulate from the posterior distribution indirectly. In
this algorithm, we need to specify a proposed density q(θ|θ(t−1) ) from which we
can generate the next θ given θ(t−1) . The procedure of the Metropolis-Hastings
algorithm is as follows.
(1) Set a starting value θ(0) and t = 1.
(2) Draw a candidate random variate θ∗ from q(θ|θ(t−1) ).
(3) Accept θ(t) = θ∗ with the acceptable probability min(1, R), where

p (θ∗ |x) q θ(t−1) |θ∗
R= .
p (θ(t) |x) q (θ∗ |θ(t−1) )

Otherwise, set θ(t) = θ(t−1) . Update t ← t + 1.


(4) If steps (2) and (3) are repeated for a large number of times, the random
variates come from the posterior distribution p(θ|x). In practice, we repeat
steps (2) and (3) for M + N times, discard the first M iterates and then select
one in every ath iterate to reduce the autocorrelations of the random variates.

2.3.5. Variational Inference


Variational inference can provide a deterministic approximation to an intractable poste-
rior distribution. Suppose the posterior distribution p(θ|x) to be sampled is intractable.
The basic idea is to pick an approximation q(θ) chosen from some tractable family, such
as a Gaussian distribution, and then to try to make q as close to p (θ|x) as possible.
The variational Bayesian (VB) method defines the “closeness” in terms of the Kullback-
Liebler (KL) divergence:

q(θ)
KL(q(θ)|p(θ|x)) = q(θ) log dθ.
p(θ|x)

41
2. Fundamental Concepts in Parametric Inference

However, KL may be hard to compute as evaluating p(θ|x) requires evaluating the


intractable normalizing constant p(x). It is easy to show that

p(x, θ)
KL(q(θ)|p(θ|x)) = −L(q) + log p(x), L(q) = q(θ) log dθ.
q(θ)
Since the KL divergence is always nonnegative, L(q) is the lower bound for the model
log-likelihood log p(x). So minimizing the KL divergence is equivalent to maximizing
L(q) as log p(x) is a constant.
One of the most popular forms of VB is the so-called mean-field approximation which
factorizes q into independent factors of the form:
"
q(θ) = qi (θi ).
i

To retain the possible dependency among the θi ’s, one may adopt a structural factoriza-
tion of q(θ) as
q(θ) = q1 (θ1 )q2 (θ2 |θ1 )q3 (θ3 |θ2 , θ1 ).
Under either the mean-field approximation or the structural factorization, we can convert
the problem of minimizing the KL divergence with respect to the joint distribution q(θ)
to that of minimizing KL divergences with respect to individual univariate distribution
qi ’s. For an application to an angle-based ranking model, see Chapter 11.

2.4. Exercises
Exercise 2.1. Suppose that the conditional density of Y given μ is normal with mean μ
and variance σ 2 . Suppose that the distribution of μ is normal with mean m and variance
Y −E[Y ]
τ 2 . Show that √ is also normally distributed.
V ar(Y )

Exercise 2.2. Let {X1 , X2 , . . . , Xn } be a random sample of size n from a cumulative


distribution F (x) with density f (x) .
(a) Find the joint density of X(i) , X(j)

(b) Find the distribution of the range X(n) − X(1) when F is given by the uniform on
the interval (0, 1) .

(c) Find the distribution of the sample median when n is an odd integer.
Exercise 2.3. Suppose that in Exercise 2.2, F (x) is the exponential cumulative distri-
bution with mean 1.
(a) Show that the pairwise differences X(i) − X(i−1) for i = 2, . . . , n are independent.

(b) Find the distribution of the X(i) − X(i−1) .

42
2. Fundamental Concepts in Parametric Inference

Exercise 2.4. Consider the usual regression model

Yi = xi β + ei , i = 1, . . . , n,

where β and xi are p × 1 vectors and {ei } are i.i.d. N (0, σ 2 ) random error terms.
Suppose we wish to test
H0 : Aβ = 0
against
H1 : Aβ = 0,
where A is a k × p known matrix.

(a) Obtain the likelihood ratio test.

(b) Obtain the Wald test.

(c) Obtain the Rao score test.



Exercise 2.5. By using a Taylor series expansion of log (1 + δ) in (2.9) for δ = θx̄0 − 1 ,
show that the likelihood ratio test in Example 2.13 is asymptotically equivalent to the
Wald test.

Exercise 2.6. Suppose that we have a random sample of size n from a Bernoulli distri-
bution with mean θ. Suppose that we impose a Beta(α, β) prior distribution on θ. Find
the posterior distribution of θ.

Exercise 2.7. Suppose that we have a random sample of size n from some distribution
having density f (x|θ) conditional on θ. Suppose that there is a prior density g (θ) on θ.
We would like to estimate θ using square error loss, L (θ, a) = (θ − a)2 .

(a) Show that the mean of the posterior distribution minimizes the Bayes risk
EL (θ, a) ;

(b) Find the mean of the posterior distribution when

f (x|θ) = θe−θx , x > 0, θ > 0


and
β α α−1 −βθ
g (θ) = x e , θ > 0, α > 0, β > 0
Γ (α)
is the Gamma density.

43
2. Fundamental Concepts in Parametric Inference

Exercise 2.8. Suppose that X has a uniform density on the interval θ − 12 , θ + 12 .
X +X
(a) Show that for a random sample of size n, the estimator (n) 2 (1) has mean square
1
error 2(n+2)(n+1) where X(n) , X(1) are respectively the maximum and minimum of
the sample.

(b) Show that the estimator X(n) − 12 is consistent and has mean square error 2
(n+2)(n+1)
.

Exercise 2.9. Suppose that under Pn we have a uniform distribution on the interval [0, 1]
whereas under Qn the distribution is uniform on the interval [an , bn ] with an → 0, bn → 1.
Show that Pn , Qn are mutually contiguous.

44
3. Tools for Nonparametric
Statistics
Nonparametric statistics is concerned with the development of distribution free methods
to solve various statistical problems. Examples include tests for a monotonic trend, or
tests of hypotheses that two samples come from the same distribution. One of the
important tools in nonparametric statistics is the use of ranking data. When the data
is transformed into ranks, one gains the simplicity that objects may be more easily
compared. For example, if web pages are ranked in accordance to some criterion, one
obtains a quick summary of choices. In this chapter, we will study linear rank statistics
which are functions of the ranks. These consist of two components: regression coefficients
and a score function. The latter allows one to choose the function of the ranks in order
to obtain an “optimal score” while the former is tailor made to the problem at hand.
Two central limit theorems, one due to Hajek and Sidak and the other to Hoeffding, play
important roles in deriving asymptotic results for statistics which are linear functions of
the ranks.
The subject of U statistics has played an important role in studying global properties
of the underlying distributions. Many well-known test statistics including the mean and
variance of a random sample are U statistics. Here as well, Hoeffding’s central limit
theorem for U statistics plays a dominant role in deriving asymptotic distributions. In
the next sections we describe in more detail some of these key results.

3.1. Linear Rank Statistics


The subject of linear rank statistics was developed by (Hájek and Sidak (1967)). In
this section we mention briefly some of the important properties. Let {X1 , . . . , Xn }
constitute a random sample from some continuous distribution. The rank of Xi denoted
Ri is defined to be the ith largest value among the {Xj } . A functional definition can be
given as
n
Ri = 1 + I [Xi − Xj ] ,
j=1

© Springer Nature Switzerland AG 2018 45


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_3
3. Tools for Nonparametric Statistics

where 
1 u > 0,
I [u] =
0 u < 0.
We begin with the definition of a linear rank statistic.

Definition 3.1. Let X1 , . . . , Xn constitute a random sample from some continuous


distribution and let R1 , . . . , Rn represent the corresponding ranks. A statistic of the
form
n
Sn = a (i, Ri )
i=1

is called a linear rank statistic where the {a (i, j)} represents a matrix of values. The
statistic is called a simple linear rank statistic if

a (i, Ri ) = ci a (Ri )

where {c1 , . . . , cn } and {a (1) , . . . , a (n)} are given vectors of real numbers. The {ci } are
called regression coefficients whereas the {ai } are labeled scores.

To illustrate why the {ci } are called regression coefficients, suppose that X1 , . . . , Xn
are a set of independent random variables such that

Xi = α + Δci + εi ,

where {εi } are i.i.d. continuous random variables with cdf F having median 0. We are
interested in testing hypotheses about the slope Δ. Denoting the ranks of the differences
{Xi − Δci } by R1 , . . . , Rn and assuming that c1 ≤ c2 ≤ . . . ≤ cn we see that the usual
sample correlation coefficient for the pairs (ci , a (Ri )) is

(ci − c̄) (a (Ri ) − ā)
  i ,
2 2
i (ci − c̄) i (a (Ri ) − ā)

where a (.) is an increasing function and c̄ and ā are respectively the means of the
{ci }and{a (Ri )}. Some simplification shows this is a simple linear rank statistic.
Many test statistics can be expressed as simple linear rank statistics. For example,
suppose that we have a random sample of n observations with corresponding ranks {Ri }
and we would like to test for a linear trend. In that case, we may use the simple linear
rank statistic
n
Sn = iRi
i=1

which correlates the ordered time scale with the observed ranks of the data.

46
3. Tools for Nonparametric Statistics

Table 3.1.: Two-sample Problems of location


ai

Wilcoxon i − n+1
2

Terry-Hoeffding E V(i)
i
Van der Warden Φ−1 n+1

Table 3.2.: Two-sample Problems of scale

ai
2
Mood i − n+1
2
 
Freund-Ansari-Bradley i − n+1 
2


⎪ 2i i even, 1 < i ≤ n2

⎨2i − 1 i odd, 1 ≤ i < n2
Siegel-Tukey
⎪2 (n − i) + 2 i even, n2 < i ≤ n



2 (n − i) + 1 i odd, n2 < i < n
 −1 i 2
Klotz Φ n+1
 2
Normal E V(i)

In another example, suppose we have a random sample of size n from one population
(X) and N − n from another (Y ). We are interested in testing the null hypothesis that
the two populations are the same against the alternative that they differ in location.
We may rank all the N observations together and retain only the ranks of the second
population (Y ) denoted Ri , i = n + 1, . . . , N , by choosing

0 i = 1, . . . , n
ci =
1 i = n + 1, . . . , N.

The test statistic takes the form Tn = N i=n+1 ci Ri . Properties of Tn under the null
and alternative hypotheses are given in Gibbons and Chakraborti (2011).
We may generalize this test statistic by defining functions {a (Ri ) , i = 1, . . . , N } of
the ranks and choosing
N
Tn = ci a (Ri ) .
i=n+1

47
3. Tools for Nonparametric Statistics

These functions may be chosen to reflect emphasis either on location or on scale. For
both the problems of location and scale, Tables 3.1 and 3.2 list values of the constants
for location and scale alternatives. Here, V(i) represents the ith order statistic from
a standard normal with cumulative distribution function Φ (x). We shall see in later
chapters that this results in the well-known two-sample Wilcoxon statistic.
Lemma 3.1. Suppose that the set of rankings (R1 , . . . , Rn ) are uniformly distributed

over the set of n! permutations of the integers 1,2,. . . ,n. Let Sn = ni=1 ci a (Ri ) . Then,

(a) for i = 1, . . . , n,
n+1 (n2 − 1)
E [Ri ] = , V ar (Ri ) =
2 12
and for i = j,
n+1
Cov (Ri , Rj ) = − .
12
 
(b) E [Sn ] = nc̄ā, where c̄ = ni=1 ci /n and ā = ni=1 a (i) /n.
n 2 n 2
i=1 (ci − c̄) i=1 (a (i) − ā) .
1
(c) V ar(Sn ) = n−1
n
(d) Let Tn = i=1 di b (Ri ) , for regression coefficients {di } and score function b (.).
Then
 n

Cov (Sn , Tn ) = σab (ci − c̄) di − d¯ ,
i=1
n n
where d¯ = i=1 di /n and σab = 1
n−1 i=1 (a (i) − ā) (b (i) − b̄).

Proof. The proof of this lemma is straightforward and is left as an exercise for the
reader.

Example 3.1. The simple linear rank statistic Sn = ni=1 iRi can be used for testing
the hypothesis that two continuous random variables are independent. Suppose that we
observe a random sample of pairs (Xi , Yi ) , i = 1, . . . , n. Replace the X  s by their ranks,
say R1 , . . . , Rn and the Y  s by their ranks, say Q1 , . . . , Qn . A nonparametric measure of
the correlation between the variables is the Spearman correlation

i Ri − R̄ Q1 − Q̄
ρ =  2  2
i Ri − R̄ i Q1 − Q̄

i i− 2 Ri − n+1
n+1
2
=  n+1 2

i i− 2

Under the hypothesis of independence we have from Lemma 3.1 (a),

E[ρ] = 0, V ar(ρ) = 1.

48
3. Tools for Nonparametric Statistics

Under certain conditions, linear rank statistics are asymptotically normally distributed.
We shall consider square integrable functions φ defined on the interval (0, 1) which
have the property that they can be written as the difference between two nondecreasing
functions and satisfy  1
 2
0< φ (u) − φ̄ du < ∞
0
1
where φ̄ = 0
φ (u) du. An important result for proving limit theorems is the Projection
Theorem.

Theorem 3.1 (Projection Theorem). Let T (R1 , . . . , Rn ) be a rank statistic and set

â (i, j) = E [T |Ri = j]

Then the projection of T into the family of linear rank statistics is given by

n−1
n
T̂ = â (i, Ri ) − (n − 2) E[T ].
n i=1

Proof. The proof of this theorem is given in (Hájek and Sidak (1967), p. 59).

Example 3.2 (Hájek and Sidak (1967), p. 60). Suppose that R1 , . . . , Rn are uniformly
distributed over the set of n! permutations of the integers 1, 2, . . . , n. The projection of
the Kendall statistic defined by

i
=j sgn (i − j) sgn (Ri − Rj )
τ=
n (n − 1)

into the family of linear rank statistics is given by



8 i − n+1
2
Ri − n+1
2
τ̂ =
n (n − 1)
2
 
2 n+1
= ρ. (3.1)
3 n

The proof proceeds by noting E[τ̂ ] = 0 and




⎪ 0 k = i, k = h






E [sgn (Ri − Rh ) |Rk = j] = 2j−n−1
k = i, 1 ≤ j ≤ n


n−1





⎩ n+1−2j
n−1
k = h, 1 ≤ j ≤ n

49
3. Tools for Nonparametric Statistics

and hence

1  2j − n − 1  n + 1 − 2j
E [τ̂ |Rk = j] = sgn (k − h) + sgn (i − k)
n (n − 1) h n−1 i
n−1

8 i i − n+12
Ri − n+1
2
= 2 .
n (n − 1)

The result now follows from the projection theorem.

Theorem 3.2. Suppose that R1 , . . . , Rn are uniformly distributed over the set of n!
permutations of the integers 1,2,. . . ,n. Let the score function be given by any one of the
following three functions:
 
i
a (i) = φ
n+1
 i/n
a (i) = n φ (u) du
(i−1/n)

 
a (i) = E φ Un(i)

(1) (n)
where φ is a square integrable function on (0, 1) and Un < . . . < Un are the order
statistics from a sample of n uniformly distributed random variables. Let


n
Sn = ci a (Ri ) .
i=1

Then, provided the following Noether condition holds


n
(ci − c̄)2
i=1
−→ ∞,
maxi≤n (ci − c̄)2
as n −→ ∞,
Sn − nc̄ā L

→ N (0, 1)
σn
where

1  
n
σn2 = (ci − c̄)2 (ai − ā)2
n − 1 i=1
  1
n
2  2
≈ (ci − c̄) φ (u) − φ̄ du.
i=1 0

50
3. Tools for Nonparametric Statistics

Proof. The details of the proof of this theorem, given in (Hájek and Sidak (1967), p. 160),
+ Ri ,
consist in showing that the n+1 behave asymptotically like a random sample of uni-
formly distributed random variables and then that Sn is equivalent to a sum of indepen-
dent random variables to which we can apply the Lindeberg-Feller central limit theorem
(Theorem 2.4).

The asymptotic distribution of various linear rank statistics under the alternative
was obtained using the concept of contiguity (Hájek and Sidak (1967) which we further
describe in Chapter 9 when we discuss efficiency. See also (Hajek (1968)) for a more
general result.

3.2. U Statistics
The theory of U statistics was initiated and developed by Hoeffding (1948) who used
it to study global properties of a distribution function. Let {X1 , . . . , Xn } be a random
sample from some distribution F. Let h (x1 , . . . , xm ) be a real valued measurable function
symmetric in its arguments such that

E [h (X1 , . . . , Xm )] = θF .

The smallest integer m for which h (X1 , . . . , Xm ) is an unbiased estimate of θF called


the degree of θF .

Definition 3.2. A U statistic for a random sample of size n ≥ m is defined to be



Un = (nm )−1 h (Xi1 , . . . , Xim ) ,

where the summation is taken over all combinations of m integers (i1 < i2 < . . . < im )
chosen from (1, 2, . . . , n).
An important property of a U statistic is that it is a minimum variance unbiased
estimator of θF . There are numerous examples of U statistics which include the moments
of a distribution, the variance, and the serial correlation coefficient. We present some
below.

Example 3.3. (i) The moments of a distribution are given by the choice h (x) = xr .
(xi −xj )2
(ii) The sample variance is obtained from the choice of h (xi , xj ) = 2
from which
we get
 (xi − xj )2
Sn2 = (n2 )−1 , n > 1.
i<j
2

51
3. Tools for Nonparametric Statistics

(iii) Let (xi , yi ) be a sequence of n pairs of real numbers and construct for each coordi-
nate, the n (n − 1) signs of differences {sgn (xi − xj ) , i = j} {sgn (yi − yj ) , i = j}.
Define the kernel

h ((xi xj ) , (yi , yj )) = sgn (xi − xj ) sgn (yi − yj ) .

Then the U statistic


1 
[sgn (xi − xj ) sgn (yi − yj )]
n (n − 1) i
=j

with the sum extending over all possible pairs of indices is the covariance between
the signs of the differences between the two sets. This is the Kendall statistic often
used for measuring correlation.

(iv) Gini’s mean difference statistic for a set of n real numbers {xi } given by

1 
|xi − xj |
n (n − 1) i
=j

is a U statistic which has seen many applications in economics. It measures the


spread of a distribution.

(v) The serial coefficient for a set of n real numbers {xi } given by


n−1
xi xi+1
i=1

is a U statistic.
We may obtain a general expression for the variance of a U statistic. Denote
for c = 0, 1, . . . , m, the conditional expectation of h (X1 , . . . , Xm ) given (X1 = x1 , ..Xc
= xc , Xc+1 , . . . , Xm ) by

hc (x1 , . . . , xc ) = E [h (x1 , . . . , xc , Xc+1 , . . . , Xm )] .

Theorem 3.3. The variance of a U statistic is given by


m
n−m
V ar (Un ) = (nm )−1 (m
c ) m−c σc2 ,
c=1

where σc2 = V ar [hc (X1 , . . . , Xc )]. Moreover, the variances are nondecreasing:

σ12 ≤ σ22 ≤ . . . ≤ σm
2

52
3. Tools for Nonparametric Statistics

2
and for large n, if σm < ∞,
V ar (Un ) ∼
= m2 σ12 /n
Proof. See Ferguson (1996).

Definition 3.3. A U statistic has a degeneracy of order k if σi2 = 0 for i ≤ k and


2
σk+1 > 0.

Example 3.4. Let X1 , . . . , Xn be a random sample for which E [X] = 0. Let the kernel
function be defined as the product

h (x1 , x2 ) = x1 x2 .

Then,
E [Un ] = 0
and
h1 (x) = E [h (X1 , X2 ) |X1 = x] = 0
This implies σ12 = 0 and hence we have a degeneracy of order 1.
An important property of U statistics is that under certain conditions, they have
limiting distributions as the next theorem states.
2
Theorem 3.4. Let σm < ∞.

(i) Suppose that σ12 > 0. Then for large n


√ L
n (Un − θF ) −
→ N 0, m2 σ12 .

(ii) If the U statistic has a degeneracy of order 1, then

L



n (Un − θF ) −
→ λj Zj2 − 1 ,
1

where the {Zj } are i.i.d. N (0, 1) and the {λj } are the eigenvalues satisfying

h (x1 , x2 ) − θF = λj ϕk (x1 ) ϕk (x2 )

for orthonormal functions {ϕk (x)} for which

E [ϕk (X) ϕj (X)] = δkj .

Proof. See Ferguson (1996).

It is also possible to define a two-sample version of a U statistic.

53
3. Tools for Nonparametric Statistics

Definition 3.4. Consider two independent random samples X1 , . . . , Xn1 from F and
Y1 , . . . , Yn2 from G. Let h (x1 , . . . , xm1 ; y1 , . . . , ym2 ) be a kernel function symmetric in
the x s and separately symmetric in the y  s with finite expectation

E [h (X1 , . . . , Xm1 ; Y1 , . . . , Ym2 )] = θF,G .

A two-sample U statistic is defined to be


−1 n2 −1 
Un1 ,n2 = nm11 m2 h Xi1 , . . . , Xim1 ; Y, . . . , Yjm2 ,

where the sum is taken over all subscripts 1 ≤ i1 < . . . < im1 ≤ n1 chosen from
1, 2, . . . , n1 and subscripts 1 ≤ j < . . . < jm2 ≤ n2 chosen from 1, 2, . . . , n2 respectively.
In analogy with the one-sample case, define for i, j, the conditional expectation of
h (X1 , . . . , Xm1 ; Y1 , . . . , Ym2 ) given
(X1 = x1 , ..Xi = xi , Xi+1 , . . . , Xm1 ) and (Y1 = y1 , ..Yj = yj , Yj=1 , . . . , Ym2 ) by

hij (x1 , . . . , xi ; y1 , . . . , yj ) = E [h (x1 , . . . , xi , Xi+1 , . . . , Xm1 ; y1 , . . . , yj , Yj+1 , . . . , Ym2 )] .

Theorem 3.5. The variance of the U statistic Un1 ,n2 is given by

n1 −1 n2 −1 
m1 
m2 n  n2 −m2 
1 −m1
V ar (Un1 ,n2 ) = m1 m2 (m i
i ) m1 −i
m2
j m2 −j σij2 ,
i=1 j=1

where σij2 = V ar [hij (X1 , . . . , Xi ; Y1 , . . . , Yj )].


2
If σm 1 m2
< ∞, and n1n+n 1
2
→ λ, 0 < λ < 1 as n1 , n2 → ∞, then
√ L
n1 + n2 (Un1 ,n2 − θF,G ) −
→ N 0, σ 2 ,

where
m21 2 m22 2
σ2 =
σ10 + σ .
λ 1 − λ 01
Proof. See (Bhattacharya et al. (2016), or Lehmann (1975), p. 364) for a slight variation
of this result. The proof as in the one-sample case is based on a projection argument.
Example 3.5. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent random samples
from distributions F (x) and G (y) respectively. Let h (X, Y ) be a two-sample kernel
function and let the corresponding U statistic be given by

1 
n1 n2
Un1 ,n2 = h (Xi , Yj ) .
n1 n2 i=1 j=1

54
3. Tools for Nonparametric Statistics

Set m1 = m2 = 1 in Theorem 3.5 and let h10 (x) = E [h (x, Y )] and h01 (y) = E [h (X, y)].
Then, as n1 + n2 → ∞, with n1n+n
1
2
→ λ,
√ L
n1 + n2 (Un1 ,n2 − θ) −
→ N 0, σ 2 ,

where
2
σ10 σ2
σ2 = + 01
λ 1−λ
and
2 2
σ10 = V ar [h10 (X)] , σ01 = V ar [h01 (Y )] .

An immediate application of Example 3.5 will be seen in Theorem 5.1 when we


consider the Wilcoxon two-sample statistic.
U statistics may arise as we shall see when using composite likelihood. Consider a
density defined for a kernel T = h (X1 , . . . , Xr )

f (t; θ) = exp [θ t − K(θ)] gT (t),

where gT (t) is the density of T under the null hypothesis H0 : θ = 0.


For a sample of size n, we may construct the composite log likelihood function which
is proportional to  
l(θ) ∼ θ h(xi1 , . . . , xir ) − (nr )K(θ) , (3.2)

where the summation is over all possible subsets of r indices chosen (1, . . . , n). It is then
seen that
(θ) ∼ [θ Un (x1 , . . . , xn ) − K(θ)] .
The log likelihood obtained in (3.2) will be shown to lead to well-known test statistics.
Remark. A modern detailed account of multivariate U statistics may be found in Lee
(1990) and Gotze (1987).

3.3. Hoeffding’s Combinatorial Central Limit


Theorem
Let (Rn1 , . . . , Rnn ) be a random vector which takes the n! permutations of (1, . . . , n)
with equal probabilities 1/n!. Set


n
Sn = an (i, Rni ) ,
i=1

55
3. Tools for Nonparametric Statistics

where an (i, j) , i, j = 1, . . . , n are n2 real numbers. Let

1 1 1 
n n n n
dn (i, j) = an (i, j) − an (g, j) − an (i, h) + 2 an (g, h) .
n g=1 n h=1 n g=1 h=1

Theorem 3.6 (Hoeffding (1951a)). Then the distribution of Sn is asymptotically normal


with mean
1 
E[Sn ] = an (i, j)
n i j

and variance
1  2
V ar(Sn ) = d (i, j)
n−1 i j n

provided  
1
n i j drn (i, j)
lim   r2 = 0, r = 3, 4, . . . (3.3)
n→∞ 1  
2
n i j dn (i, j)

Equation (3.3) is satisfied if

max1≤i,j≤n d2n (i, j)


lim     = 0.
n→∞ 1 2 (i, j)
n i d
j n

In the special case, an (i, j) = an (i) bn (j), equation (3.3) is satisfied if


2
max (an (i) − ān )2 max bn (i) − b̄n
lim n   2 = 0,
n→∞ (an (i) − ān )2 bn (i) − b̄n

where
1 1
ān = an (i) , b̄n = b (i) .
n i n i

3.4. Exercises
2(2n+5)
Exercise 3.1 (Hájek and Sidak (1967), p. 81). Show that V ar(τ ) = 9n(n−1) .
Hint: Define the Kendall statistics τn and τn−1 based on (X1 , . . . , Xn ) and (X1 , . . . , Xn−1 )
respectively. Using the fact that

1
Ri = 1 + sgn (Xi − Xj ) ,
2 j
=i

56
3. Tools for Nonparametric Statistics

show that  
n−2 4 n+1
τn = τn−1 + Rn −
n n (n − 1) 2
from which we obtain the telescopic recursion for hi = [i (i − 1)]2 V ar(τi ) and

4 2
hi − hi−1 = i −1 .
3
Hence,

n
hn = (hi − hi−1 ) .
i=1

Exercise 3.2. Suppose that R1 , . . . , Rn are uniformly distributed over the set of n!
permutations of the integers 1, 2, . . . , n. Show that the conditions of Theorem 3.5 are

satisfied for the statistic Sn = ni=1 iRi .

Exercise 3.3. Apply the projection method to the sample variance for a random sample
of size n

1  2
n
Sn2 = Xi − X̄
n − 1 i=1
  (Xi − Xj )2
= (n2 )−1
i<j
2

to show that, properly standardized, it is asymptotically normal provided E[X 4 ] < ∞.


Hint: Show that
⎧  (x−X )2 

⎪ ⎪ E k
i = j, k
2 ⎨ 2
(Xi − Xj ) 
E  Xi = x =
2  ⎪
⎪  
⎩E (Xj −Xk )2 otherwise
2

Exercise 3.4.
n+1 2
(a) Show that V ar(τ̂ ) = 4
9 n
1
n−1
. Hence, as n → ∞,

V ar(τ̂ )
→ 1.
V ar(τ )

(b) Show that this implies that the Kendall and Spearman statistics are asymptotically
equivalent.

57
3. Tools for Nonparametric Statistics

Exercise 3.5. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be two independent random samples
from distributions F (x) and G (y), respectively. Let M be the number of pairs (Xi , Yj )
whereby Xi < Yj and let W represent the sum of the ranks of the Y ’s in the combined
samples. Show that
n2 (n2 + 1)
W =M+ .
2
Hint: For any j, 
I (Xi < Yj ) + j = Rank (Yj )
i

Exercise 3.6. In Example 3.5, find the projection of Un1 ,n2 onto the space of linear rank
statistics for the X  s and for the Y  s.

Exercise 3.7 (Randles and Wolfe (1979)). Find the mean and variance of the statistic


n  
2 Ri
Sn = i log
i=1
n+1

and show that as n → ∞, it has an asymptotic normal distribution.

Exercise 3.8. Consider the Spearman Footrule distance between two permutations μ, ν
of length n defined as

n
d (μ, ν) = |μ (i) − ν (i)|
i=1

It was shown in Diaconis and Graham (1977) that when μ, ν are independent and are
uniformly distributed over the integers 1, 2, . . . , n

n2
E [d (μ, ν)] = + O (n)
3
2n3
V ar [d (μ, ν)] = + O n2 .
45
Use Hoeffding’s combinatorial central limit theorem with an (i, j) = |i − j| to show that
d (μ, ν) is asymptotically normal.

Exercise 3.9. An alternate form of the projection Theorem 3.1 (see van der Vaart
(2007), p. 176) is as follows.
Let R = (R1 , . . . , RN ) be the ranks of an i.i.d. sample from a uniform distribution
U1 , . . . < UN on (0, 1). Let
 
(i)
a (i) = E φ UN

58
3. Tools for Nonparametric Statistics

where φ is a square integrable function on (0, 1) . Let


N
S̃N = N āN c̄N + (ci − c̄N ) φ (F (Xi ))
i=1

and

N
SN = ci E [φ (Ui ) |R] .
i=1

Then the projection of S̃N onto the space of linear rank statistics is SN in the sense that
 
E [SN ] = E S̃N

and  
V ar SN − S̃N
  → 1, as N → ∞.
V ar S̃N

Use the above result to show that the Wilcoxon two-sample statistic defined in
Section 3.1
1 
N
TN = Ri
N i=m+1

is asymptotically equivalent to
 
1 m n
−n F (Xi ) + m F (Yi ) .
N i=1 j=1

59
Part II.

Nonparametric Statistical Methods

61
4. Smooth Goodness of Fit Tests
Goodness of fit problems have had a long history dating back to Pearson (1900). Such
problems are concerned with testing whether or not a set of observed data emanate from
a specified distribution. For example, suppose we would like to test the hypothesis that
a set of n observations come from a standard normal distribution. Pearson proposed to
first divide the interval (−∞, ∞) into d subintervals and then calculate the statistic


d
(oi − ei )2
2
X =
i=1
ei

where oi and ei represent the number of observed and expected observations appearing
in the ith interval. The expected values {ei } are calculated as

ei = npi ,

where pi is the probability that a standard normal random variable falls in the ith
interval. The test rejects the null hypothesis that the data come from a standard normal
distribution whenever X 2 ≥ χ2d−1 (α) where χ2d−1 (α) represents the upper 100 (1 − α) %
point of a chi-square distribution with (d − 1) degrees of freedom.
Apart from having to specify the number and the choice of subintervals, one of
the drawbacks of the Pearson test is that the alternative hypothesis is vague leaving
the researcher in a quandary if in fact the test rejects the null hypothesis. Similarly,
the usual tests for goodness of fit proposed by Kolmogorov-Smirnov and Cramér von
Mises are omnibus tests that lack power when the alternatives specify departures in
location, scale, skewness, or kurtosis. Neyman (1937) in an attempt to deal with those
issues, reconsidered the problem by embedding it into a more general framework. This is
perhaps the first application of what has become known as exponential tilting whereby
the density specified by the null hypothesis is exponentially tilted to provide a density
under the alternative. Moreover, that transition from the null to the alternative occurs
in a smooth manner.

© Springer Nature Switzerland AG 2018 63


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_4
4. Smooth Goodness of Fit Tests

4.1. Motivation for the Smooth Model


The distribution for the smooth model to be defined in connection with the goodness
of fit problem can be motivated as follows. Suppose that one would like to estimate
the mean μ of a population with a confidence interval. Suppose that we observe a
set of L values x1 , . . . , xL of a random variable X. We may estimate μ by a weighted

estimate using a set of weights {wi } satisfying wi ≥ 0, wi = 1 and on which there is
a distribution 
μ̂ = w i xi .
The Kullback-Leibler information number for choosing between two distributions of
weights w and w0 is defined to be


L
D (w||wo ) = wi log (wi /w0i ) ,
i=1

where w represents the true distribution. The Kullback-Leibler measure is not a met-
ric since it does not satisfy the metric properties of symmetry and triangle inequality.
However, it satisfies the Gibbs inequality D (w||w0 ) ≥ 0. This follows from the fact that
− log x is a strictly convex function and hence
 
− wi log (w0i /wi ) ≥ − log wi (w0i /wi ) = 0.
1
Let w0 = L
, . . . , L1 . Minimizing with respect to the {wi } the Lagrange multiplier
expression    
D (w||w0 ) + θ μ − w i xi + λ 1 − wi

leads to the choice

eθxi
wi =  θxi
e

eθ(xi −μ̂)
=  θ(xi −μ̂) .
e

There is an interpretation for the parameter θ (Efron, 1981) as follows. Since the {wi }
determine an estimate μ̂, we may interpret μ̂ as indexed by θ as a contender for being
in a confidence interval for μ. Hence, varying θ leads to different estimates as one would
find in a confidence interval.
Suppose now that we fix the {xi } and resample them n times using weights {wi }
so that
P (X = xi ) = wi .

64
4. Smooth Goodness of Fit Tests
L
Let ni be the number of occurrences of xi with i=1 ni = n. Then the bootstrap
distribution corresponds to the multinomial distribution

n! "
P (n1 , . . . , nk ) = wini
n1 ! . . . nL !
n! " eθni (xi −μ̂)
=  n
n1 ! . . . nL ! ( eθ(xi −μ̂) ) i

n! eθ ni (xi −μ̂)
=  θ(x −μ̂) n
n1 ! . . . nL ! ( e i )
n!  ∗
= eθ ni (xi −μ̂)−nK (θ)
n1 ! . . . nL !

where  
K ∗ (θ) = log eθ(xi −μ̂) .

The non-bootstrapped distribution is given by

n! "  1 ni
P0 (n1 , . . . , nk ) =
n1 ! . . . n L ! L
 n
n! 1
=
n1 ! . . . n L ! L

so that the bootstrapped distribution appears as an exponentially tilted distribution

P (n1 , . . . , nL ) 
= eθ ni (xi −μ̂)−nK(θ)
P0 (n1 , . . . , nL )

= eθn(μ −μ̂)−nK(θ)

where  
1  θ(xi −μ̂)
K (θ) = log e
L
and  ni 
μ∗ = xi
n
is the bootstrapped value. Consequently, the bootstrapped distribution of μ∗ is centered
at the observed mean. These considerations have shown that an exponentially tilted
distribution arises in a natural way in an estimation context.

65
4. Smooth Goodness of Fit Tests

4.2. Neyman’s Smooth Tests


Suppose that we are presented with a random sample X1 , . . . , Xn from a continuous
strictly increasing cdf F (x) with continuous density f (x) and that we would like to test
the null hypothesis
H0 : F (x) = F0 (x) , for all x
for known F0 against the alternative

H1 : F (x) = F0 (x) , for some x.

Using the probability transformation Y = F0 (X), we see that under H0 , Y1 , . . . , Yn is a


random sample from the uniform distribution on the interval [0, 1] and we may instead
test for uniformity for the sample F (X1 ) , . . . , F (Xn ). If the alternative were specified
more precisely, one could use the Neyman-Pearson lemma to arrive at a uniformly most
powerful test. More often than not, however, the alternative is vague. Various tests such
as the Kolmogorov-Smirnov and Anderson-Darling test statistics have been proposed
for this problem (Serfling, 2009). In most cases, however, such tests have little power
and moreover, if the null hypothesis of uniformity is rejected, it is not clear what the
alternative hypothesis should be.
Neyman (1937) considered the problem of testing for goodness of fit by embedding
the uniform distribution into a larger class of alternatives defined by
 d 

π (x; θ) = exp θj hj (x) − K (θ) , 0 < x < 1 (4.1)
j=1

= exp (θ h (x)) − K (θ) ,

where
h (x) = (h1 (x) , . . . , hd (x))
and K (θ) is the normalizing constant depending on θ, chosen to make π (x, θ) a proper
density and the {hj (x)} consist of orthonormal polynomials that satisfy
 1
hi (x) dx = 0 (4.2)

0

1
0 i = j
hi (x) hj (x) dx = δij = (4.3)
0 1 i = j.

The orthonormal polynomials may be selected in order to detect different alternatives


and these lead to different tests. Corresponding to various underlying densities, some
common choices of orthogonal functions (denoted with an upper asterisk) are listed in
Table 4.1.

66
Table 4.1.: Orthogonal functions for various densities

Density Name Orthogonal function


2
 [j/2] (−1/2)t j! j−2t
Normal: √1 exp

− x2 Hermite polynomials h∗j (x) = t=0 t!(j−2t)!
x

−λ x
- 
Poisson: exp λ /x! Poisson-Charlier h∗j (x; λ) = λj /j! jt=0 (−1)j−t (xt )!λ−t jt
j j

67
Exponential: e−x h∗j (x) = t=0 t (−x)t /t!

ex d j
Gamma: xα−1 exp(−x/β)
Γ(α)β α
Laguerre polynomials h∗j (x) = j! dxj
(e−x xj )
−M
(Mx )(Nn−x ) 2x
Hypergeometric N Chebyshev polynomials h0 (x, n) = 1, h1 (x, n) = 1 − n
,x = 0, 1, . . . , n
(n)
4. Smooth Goodness of Fit Tests

(i + 1)(n − i)hi+1 (x, n) =


(2i + 1)(n − 2x)hi (x, n) − i(n + i + 1)hi−1 (x, n).
4. Smooth Goodness of Fit Tests

Under the model in (4.1), the test for uniformity then becomes a test of

H0 : θ = 0 vs H1 : θ = 0. (4.4)

For small values of θj close to 0, we see that π (x; θ) is close to the uniform density.
Hence, the model in (4.1) provides a “smooth” transition from the null hypothesis and
as we shall later see, leads to tests with optimal properties. We note that under this
formulation, the original nonparametric problem has been cast as a parametric one. The
next theorem specifies the test statistic.

Theorem 4.1. Let X1 , . . . , Xn be a random sample from (4.1). The score test for testing
(4.4) rejects whenever
d
Uj2 > cα ,
j=1

where
1 
n
Uj = √ hj (Xi )
n i=1

and cα satisfies P (χ2d > cα ) = α.

Proof. The log likelihood for the density in (4.1) is


d 
n
l (θ; x) = θj hj (xi ) − nK (θ)
j=1 i=1

which yields the score vector

∂l (θ; X)
U (θ; X) =
∂θ
with jth component
∂l (θ; x) √ ∂K (θ)
= nuj − n
∂θj ∂θj
Differentiating with respect to θj and evaluating the derivative at θ = 0:
 1

π (x; θ) dx = 1
∂θj 0

leads to
∂K (θ)
=0
∂θj

68
4. Smooth Goodness of Fit Tests

in view of (4.2) and consequently

∂l (θ; x) √
= nuj .
∂θj

Also, the (i, j) component of the information matrix In (θ) in view of (4.3) evaluated
at θ = 0 is
∂ 2 K (θ)
= nδij .
∂θi ∂θj
It follows that the score test statistic at θ = 0 is

d

U (θ; X) In−1 (θ) U (θ; X) = Uj2 .
j=1

Lemma 2.5 then yields the asymptotic distribution of the score statistic. An alternative
direct proof makes use of the fact that the Ui s are sums of i.i.d. random variables.

The change of measure or exponential tilting model introduced in (4.1) has been
used in rare event simulation (Asmussen et al., 2016) as well as in rejection sampling
and importance sampling. We illustrate the latter use in the following example.

Example 4.1 (Importance Sampling). Suppose that X is a random variable having


a normal distribution with mean 0 and variance 1 and we would like to estimate the
p-value
 ∞  
1 x2
p = P (X > c) = √ exp − 2 dx
c 2πσ 2σ
≡ E1 [I (X > c)]

where c > 0 and I (.) is the indicator function. This can be done in one of two ways.
In the first case, we may take a random sample of size n and calculate the unbiased
estimator
1
n
p̂ = I (Xi > c)
n i=1
whose variance is equal to
p (1 − p) 1+ ,
= E [I (X > c)] − p2 . (4.5)
n n
Alternatively, we note that
  

1 (x − c)2
p = Λ (x) √ exp − dx
c 2πσ 2σ 2
≡ E2 [Λ (X) I (X > c)]

69
4. Smooth Goodness of Fit Tests

where
2
exp − 2σ
√1
2πσ
x
2
Λ (x) = 
(x−c)2
√ 1 exp −
2πσ 2σ 2
 
1
= exp − 2 c (2x − c)

is the likelihood ratio. We may now take a random sample of size n and calculate
 the
(x−c)2
unbiased estimator with respect to the change of measure √2πσ exp − 2σ2
1
:

1
n
p̂c = Λ (Xi ) I (Xi > c)
n i=1

whose variance is 1+   ,
E2 Λ (X)2 I (X > c) − p2 . (4.6)
n
Since on the set I (X > c)
Λ (X) ≤ 1
then the inequality
 
E2 Λ (X)2 I (X > c) ≤ E2 [Λ (X) I (X > c)] = E1 [I (X > c)]

shows that the variance (4.6) of the second estimator will not exceed that of the first
in (4.5). See Siegmund (1976) for further discussion on importance sampling and its ap-
plication to the calculation of error probabilities connected to the sequential probability
ratio test.

In this book, we shall make use of exponential tilting to describe many common
nonparametric statistics. Setting

π (x; θ) = exp (θx − K (θ)) gX (x)

we record in Table 4.2 various examples of the change of measure for different densities
gX (x) where the latter is determined from π (x; θ) under θ = 0. In all cases, the normal-
izing constant K (θ) is the cumulant or log of the moment generating function computed
under gX (x)

K (θ) = log (E [exp (θX)]) .


We note that K (θ) is a strictly convex function such that

Eθ [X] = K  (θ) , V arθ (X) = K  (θ) .

70
Table 4.2.: Examples of change of measure distributions
Density gX (x) π (x; θ) K (θ)
θ2 σ2
N (μ, σ 2 ) N (μ + θσ 2 , σ 2 ) 2
+ θμ

θ
β exp (−βx),x > 0, β > 0 (β − θ) exp (− (β − θ) x) , θ < β − log 1 − β
n n peθ
p
px (1 − p)n−x , x = 0, . . . , n p
pxθ (1 − pθ )n−x , x = 0, . . . , n,0 < pθ < 1, pθ = − log (1 − pθ )
1 − p + peθ
e−μ μx e−μθ μx

71
θ
x!
, μ > 0, x = 0, 1, . . . x!
, μθ = μeθ , x = 0, 1, . . . μ eθ − 1 
β α (β−θ)α α−1 −(β−θ)x
Γ (α)
xα−1 e−βx ,α, β > 0 Γ (α)
x e ,α, β − θ > 0 −α log 1 − βθ
θ2 Σ
N (μ, Σ) N (μ + Σθ, Σ) 2
+ θμ
r/2
1 (1/2−θ) 1
2r/2 Γ (r/2)
xr/2−1 e−x/2 , x, r >0 Γ (r/2)
xr/2−1 e−(1/2−θ)x ,r > 0, θ < 1/2 (−r/2) log (1 − 2θ) , θ < 2
4. Smooth Goodness of Fit Tests

 
peθ
(1 − p)x−1 p, x = 1, 2, . . . ; p > 0 (1 − pθ )x−1 pθ , x = 1, 2, . . . ; pθ = (1 − θ − log (1 − p)) > 0 log
1 − (1 − p) eθ
4. Smooth Goodness of Fit Tests

Example 4.2. Suppose that we are given a random sample of size n from a normal
distribution with mean μ and variance σ 2 . We would like to test a null hypothesis on
the mean μ. Under the smooth alternative formulation,

2
1 (x − μ − θσ 2 )
π (x; θ) = √ exp − .
2πσ 2σ 2

Based on a random sample of size n, the log of the likelihood function as a function of
θ is proportional to  n 2 2
nθ (x̄n − μ) − σ θ ,
2
where x̄n is the sample mean. Setting the derivative of the log with respect to θ equal
to 0 shows that the maximum likelihood estimate of θ is
(x̄n − μ)
θ̂ = .
σ2

Consequently, θ̂ represents the shift of the sample mean from the null hypothesis mean
μ0 as specified by the density gX (x). The score statistic is given by the derivative of the
log likelihood evaluated at θ = 0:
n [x̄n − μ] .

Moreover, from the above equation, the information function is given by K  (0) =
nσ 2 . Hence the Rao score statistic from Theorem 2.7 becomes
  −1  
W = n2 X̄n − μ (K  (0)) X̄n − μ (4.7)
n  2
= 2
X̄n − μ . (4.8)
σ
The null hypothesis is rejected for large values of W which has asymptotically as n → ∞,
a chi-square distribution with one degree of freedom. The advantage of the Rao score
test statistic (4.7) is that all the derivatives are computed under θ = 0. Asymptotically, it
is equivalent to both the generalized likelihood ratio test and the Wald test as indicated
in Theorem 2.7. It is seen then that (4.8) is the usual two-sided test statistic for the
mean of a normal. We may generalize the previous sections to the case of composite
hypotheses.

4.2.1. Smooth Models for Discrete Distributions


In this section, we describe some orthonormal expansions for two discrete distributions:
the hypergeometric and the binomial. It is somewhat analogous to the smooth alterna-
tive model described in (4.1) where we invoke the use of Chebyshev polynomials. The
Chebyshev polynomials defined over the integers x = 0, 1, . . . , n can be expressed in

72
4. Smooth Goodness of Fit Tests

descending factorial form as


i     −1
i i+m
m x n
hi (x, n) = (−1) , (4.9)
m=0
m m m m

where 0 ≤ i ≤ n (Ralston (1965), p. 238). There is also an ascending factorial form. For
these polynomials we have the recursion

2x
h0 (x, n) = 1, h1 (x, n) = 1 − ,
n

(i + 1)(n − i)hi+1 (x, n) = (2i + 1)(n − 2x)hi (x, n) − i(n + i + 1)hi−1 (x, n).

For every n, the vectors

εi = (hi (0, n), . . . , hi (n, n)) for i = 0, 1, . . . , n.


form a basis in (n + 1)-dimensional space. Hence, any function {g(x)} defined over the
integers x = 0, . . . , n can be expressed in vector notation


n
g= g i εi
i=0

where the vector g = (g(0), g(1), . . . , g(n)) and the gi are obtained from the relation

g  εi
gi = .
||εi ||2
Alternatively, for each x


n
g(x) = gi hi (x, n) for x = 0, 1, . . . , n.
i=0

Example 4.3 (Hypergeometric Distribution). The hypergeometric distribution is


given by
M N −M
p(x; M, N ) = x
Nn−x
for max(0, n − N + M ) ≤ x ≤ min (n, M ) .
n

The connection between the Chebyshev polynomials and the hypergeometric distribution
is given by the following theorem proven in Alvo and Cabilio (2000).

Theorem 4.2. The expected value of a function of a hypergeometric variable with pa-
rameters (M, N, n) is equal to a linear combination of the first n Chebyshev polynomials

73
4. Smooth Goodness of Fit Tests

in parameters M, N. as

n
E [g(X)] = gi hi (M, N ).
i=0

Proof. The proof uses the representation in (4.9) and proceeds by showing that


n
hi (x, n)p(x; M, N ) = hi (M, N ) for all i = 0, 1, . . . , n, and M = 0, 1, . . . , N.
x=0

A consequence of Theorem 4.2 is that if g (x) = I (X = x), then


n
p(x; M, N ) = gi hi (M, N ). (4.10)
i=0

The R package hypersampleplan with subroutine hypergeotable (N, n) can be used to


compute tables of hypergeometric probabilities simultaneously for all values of

x = 0, 1, . . . , n and M = 0, 1, . . . , N.

Alvo and Cabilio (2000) discuss similar results to the binomial distribution.

Example 4.4 (Binomial Distribution). The distribution of a binomial random variable


X may be expressed in terms of (4.1) as

b(x; n, ρ) = (nx ) ρx (1 − ρ)n−x , x = 0, 1, . . . , n; 0 < ρ < 1.

= (nx ) exp (θx − K (θ))

where  
ρ
θ = log , K (θ) = n log 1 + eθ
1−ρ
Here our interest would be in testing the null hypothesis that ρ = 0.5 corresponding
to θ = 0.

4.2.2. Smooth Models for Composite Hypotheses


Suppose that we wish to test the hypothesis that a random sample of size n arises
from a given probability density g (x; β) where β is a q-dimensional vector of nuisance
parameters. In the continuous case, we may embed this in the following parametric

74
4. Smooth Goodness of Fit Tests

model


k
f (x; θ, β) = exp θi hi (x; β) − K (θ, β) g (x; β) ,
i=1

where K (θ, β) is a normalizing constant, θ is a k-dimensional vector of parameters, and


{hi (x; β)} are a set of orthonormal functions with respect to g (x; β) . That is to say,
they must satisfy
 ∞
hi (x; β) g (x; β) dx = 0
−∞
 ∞
hi (x; β) hj (x; β) g (x; β) dx = δij .
−∞

It is seen that since f (x; θ, β) is a proper density, K (θ, β) must satisfy

∂K (θ, β)
= Eθ [hi (X; β)]
∂θi

and
∂K 2 (θ, β)
= Covθ [hi (X; β) , hj (X; β)] .
∂θi ∂θj
The score test statistic is then given by
  −1 
S β̂ = U  β̂ Σ̂ U β̂ ,

where β̂ is the maximum likelihood estimate of β and U β̂ has rth element
n  
i=1 h j X i ; β̂ . Here Σ̂ is the estimated asymptotic covariance matrix of U β̂ .
We refer the reader to (Rayner et al. (2009b), p. 100) for further details.

Example 4.5. Suppose that we have a random sample X1 , . . . , Xn from a normal distri-

bution with mean μ and variance σ 2 ; set β = (μ, σ 2 ) . A smooth test for the hypothesis

H0 : μ = μ0

against
H1 : μ = μ0
can be constructed using the normalized Hermite polynomials from Table 4.1 which
satisfy
 ∞
x2
hi (x) e− 2 dx = 0,
 ∞ −∞ √
x2
hi (x) hj (x) e− 2 dx = 2πδrs .
−∞

75
4. Smooth Goodness of Fit Tests
+ ,
The Hermite polynomials with respect to the distribution of X are given by hi x−μ
σ
.
The maximum likelihood estimates of β are
  2 
Xi − X̄n
β̂ = X̄n , .
n

We note that
       2

x−μ x−μ x−μ 1 x−μ


h1 = , h2 =√ −1
σ σ σ 2 σ

and consequently for j = 1, 2,


⎛ ⎞

n
⎜ Xi − X̄n ⎟
hj ⎜
⎝1
⎟ = 0.
2 ⎠
i=1 (Xi −X̄n )
n

It can be shown that the score statistic for testing H0 : μ = μ0 is given by


k
V̂j2 ,
j=3

where ⎛ ⎞
1  ⎜ Xi − X̄n ⎟
n
V̂j = √ hj ⎜ 1 ⎟.
n i=1 ⎝ (Xi −X̄n )2 ⎠
n

4.3. Smooth Models for Categorical Data


The model in (4.1) can be adapted to apply to categorical data. Suppose we have m
categories and that

π (xj ; θ) = exp (θ  xj − K(θ)) pj , j = 1, . . . , m (4.11)

where xj is the jth value of a k-dimensional random vector X and p = (pj ) denotes
the vector of probabilities under the null distribution when θ = θ 0 . Here K (θ) is a
normalizing constant for which

π (xj ; θ) = 1.
j

76
4. Smooth Goodness of Fit Tests

Let T = [xi , . . . , xm ] be the k × m matrix of possible vector values of X. Then under


the distribution specified by p,
 
Σ ≡ Covp (X) = Ep (X − E [X]) (X − E [X]) (4.12)

= T (diag (p)) T  − (T p) (T p) (4.13)

where the expectation is with respect to the model (4.11). As we shall see in Chapter 5,
this particular situation arises often when dealing with the nonparametric randomized
block design. Define
π (θ) = (π (x1 ; θ) , . . . , π (xm ; θ)) .

Example 4.6. Suppose that we would like to test

H0 : θ = 0 vs H1 : θ = 0.

Letting N denote a multinomial random variable with parameters (n, π (θ)), we see
that the log likelihood as a function of θ is, apart from a constant, proportional to


m 
m
nj log (π (xj ; θ)) = nj (θ  xj − K(θ))
j=1 j=1
 

m
= θ nj xj − nK(θ)
j=1

The score vector under the null hypothesis is then given by


m  
1 ∂πj (θ)
U (θ; X) = Nj
j=1
πj (θ) ∂θ
= T (N − np)

Under the null hypothesis,


E [U (θ; X)] = 0
and the score statistic is given by

1 1 L
[T (N − np)] Σ−1 [T (N − np)] = (N − np) T  Σ−1 T (N − np) −
→ χ2r (4.14)
n n
 −1
where r = rank T Σ T .

We shall return to this example in Section 7.2 when we consider applications to


the randomized block design. In the following example, we consider the multinomial
distribution in connection with the Pearson goodness of fit statistic.

77
4. Smooth Goodness of Fit Tests

Example 4.7 (Pearson’s Goodness of Fit Statistic (Rayner et al. (2009b), p. 68)). We
shall show that the Pearson goodness of fit statistic is given by

m
(Nj − npj )2 L 2

→ χm−1 ,
j=1
npj

where pj = 1 may be obtained using the smooth model formulation.
Define the random vector x∗ as
 
∗ x
x =
1

where x is of dimension (m − 1). Consider the smooth model



π x∗j ; θ = exp θ  x∗j − K(θ) pj , j = 1, . . . , m

The matrix of possible values of x∗ is an m × m matrix


 
∗ T
T =
1m

where 1m is an m × 1 vector of ones. We note that


  
 xj pj
K (0) = 
pj

and we may prescribe the {xj } by imposing the orthonormality conditions



xj pj = 0

and
 
K  (0) = x∗j x∗j pj

= T ∗ (diag (pj )) T ∗ (4.15)
= Im

where I m is the identity matrix of order m. It follows that since


⎡ ⎤
T (diag (pj )) T  T (diag (pj )) 1m
Im = ⎣ ⎦

1m (diag (pj )) T 1m (diag (pj )) 1m

78
4. Smooth Goodness of Fit Tests

and

1m (diag (pj )) 1m = pj = 1
 
1m (diag (pj )) T  = xj pj = 0,

then
T (diag (pj )) T  = I m−1 .
Now from (4.15)

diag p−1
j = T∗ T∗
= T  T + 1m 1m .

Hence, the score statistic

1 1  
[(N − np)] T  T [(N − np)] = (N − np) diag p−1 j − 1m 1m (N − np)
n n
m
2
m
(Nj − npj )2 1 
= − (Nj − npj )
j=1
npj n j=1
m
(Nj − npj )2
= .
j=1
npj

4.4. Smooth Models for Ranking Data


In this section, we will study distance-based models within the context of smooth alter-
native distributions. We begin by describing the important Mallows models and their
generalization before proceeding to the study of cyclic models.

4.4.1. Distance-Based Models


A ranking represents the order of preference one has with respect to a set of t objects.
If we label the objects by the integers 1 to t, a ranking can then be thought of as a
permutation of the integers (1, 2, . . . , t). We may denote such a permutation by

μ = (μ(1), μ(2), . . . , μ(t))

which may also be conceptualized as a point in t-dimensional space. It is natural to


measure the spread or discrepancy between two individual rankings μ, ν by means of a
distance function.

79
4. Smooth Goodness of Fit Tests

The usual properties of a distance function between two rankings μ and ν are: (1)
reflexivity: d(μ, μ) = 0; (2) positivity: d(μ, ν) > 0 if μ = ν; and (3) symmetry:
d(μ, ν) = d(ν, μ). For rankings, we often require that the distance, apart from having
these usual properties, must be right invariant,

d(μ, ν) = d(μ ◦ τ , ν ◦ τ ), where μ ◦ τ (i) ≡ μ(τ (i)).

The requirement right invariance ensures that a relabeling of the objects has no effect on
the distance. If a distance function satisfies the triangle inequality d(μ, ν) ≤ d(μ, σ) +
d(σ, ν), the distance is said to be a metric. There are several examples of distance
functions that have been proposed in the literature. Here are a few:
Spearman:
1
t
dS (μ, ν) = (μ(i) − ν(i))2 (4.16)
2 i=1
Kendall: 
dK (μ, ν) = {1 − sgn (μ(j) − μ(i)) sgn (ν(j) − ν(i))} (4.17)
i<j

where sgn(x) is either 1 or −1 depending on whether x > 0 or x < 0.


Hamming:

t  t
dH (μ, ν) = t − I (μ(i) = j) I (ν(i) = j) (4.18)
i=1 j=1

where I(.) is the indicator function taking values 1 or 0 depending on whether the
statement in brackets holds or not.
Spearman Footrule:
t
dF (μ, ν) = |μ(i) − ν(i)| (4.19)
i=1

Cayley:
dC (μ, ν) = n − #cycles in ν ◦ μ−1
or equivalently, it is the minimum of transpositions needed to transform μ into ν. Here,
μ−1 = μ−1 (1) , . . . , μ−1 (t) denotes the inverse permutation that displays the objects
receiving a specific rank.
Note that the Spearman Footrule, Kendall, Hamming, and Cayley distances are
metrics but the Spearman distance, like the squared Euclidean distance, is not since it
does not satisfy the triangular inequality property. We shall nonetheless for convenience
refer to it as a distance function in this book. The Kendall distance counts the number
of “discordant” pairs whereas the Hamming distance counts the number of “mismatches.”
The Hamming distance has found uses in coding theory.

80
4. Smooth Goodness of Fit Tests

Let M = d μi , μj denote the matrix of all pairwise distances. If d is right
invariant, then it follows that there exists a constant c > 0 for which

M 1 = (ct!)1

where 1 = (1, 1, . . . , 1) is of dimension t!. Hence, c is equal to the average distance. It
is straightforward to show that for the Spearman and Kendall distances

t(t2 − 1) t(t − 1)
cS = , cK = .
12 2

Turning attention to the Hamming distance, we note that if e = (1, 2, . . . , t) , then

Σμ dH (μ, e) = Σμ t − Σμ Σi Σj I (μ (i) = j) I (e (i) = j.)


= t (t!) − t!

and hence cH = t − 1.

Example 4.8. Suppose that t = 3, and that the complete rankings are denoted by

μ1 = (1, 2, 3) , μ2 = (1, 3, 2) , μ3 = (2, 1, 3) , μ4 = (2, 3, 1) , μ5 = (3, 1, 2) , μ6 = (3, 2, 1)

Using the above order of the permutations, we may write the matrix M of pairwise
Spearman, Kendall, Hamming, and Footrule distances respectively as
⎛ ⎞
0 1 1 3 3 4
⎜ ⎟
⎜ 1 0 3 1 4 3 ⎟
⎜ ⎟
⎜ 1 3 0 4 1 3 ⎟
MS = ⎜ ⎟
⎜ 3 1 4 0 3 1 ⎟
⎜ ⎟
⎝ 3 4 1 3 0 1 ⎠
4 3 3 1 1 0
⎛ ⎞
0 2 2 4 4 6
⎜ ⎟
⎜ 2 0 4 2 6 4 ⎟
⎜ ⎟
⎜ 2 4 0 6 2 4 ⎟
MK = ⎜ ⎟
⎜ 4 2 6 0 4 2 ⎟
⎜ ⎟
⎝ 4 6 2 4 0 2 ⎠
6 4 4 2 2 0

81
4. Smooth Goodness of Fit Tests

⎛ ⎞
0 2 2 3 3 2
⎜ ⎟
⎜ 2 0 3 2 2 3 ⎟
⎜ ⎟
⎜ 2 3 0 2 2 3 ⎟
MH = ⎜ ⎟
⎜ 3 2 2 0 3 2 ⎟
⎜ ⎟
⎝ 3 2 2 3 0 2 ⎠
2 3 3 2 2 0
⎛ ⎞
0 2 2 4 4 4
⎜ ⎟
⎜ 2 0 4 2 4 4 ⎟
⎜ ⎟
⎜ 2 4 0 4 2 4 ⎟
MF = ⎜ ⎟
⎜ 4 2 4 0 4 2 ⎟
⎜ ⎟
⎝ 4 4 2 4 0 2 ⎠
4 4 4 2 2 0

These distances may alternatively be expressed in terms of a similarity function A


in the form
d(μ, ν) = c − A(μ, ν), (4.20)
Spearman:
t 
  
t+1 t+1
AS = AS (μ, ν) = μ(i) − ν(i) − (4.21)
i=1
2 2
Kendall: 
AK = AK (μ, ν) = sgn (μ(j) − μ(i)) sgn (ν(j) − ν(i)) . (4.22)
i<j

Hamming:

t  t    
1 1
AH (μ, ν) = I [μ(i) = j] − I [ν(i) = j] − (4.23)
i=1 j=1
t t

Footrule:    
t  t
j j
AF (μ, ν) = I [μ(i) ≤ j] − I [ν(i) ≤ j] − (4.24)
i=1 j=1
t t

The similarity measures may also be interpreted geometrically as inner products which
sets the groundwork for defining correlation (Alvo and Yu (2014)).
It is reasonable to assume that in a homogeneous population of judges, most of
the judges will have rankings close to a modal ranking μ0 . According to this frame-
work, Diaconis (1988a) developed a class of distance-based models over the set of all t!
rankings P:
e−λd(μ,μ0 )
π(μ|λ, μ0 ) = , μ ∈ P, (4.25)
C(λ)

82
4. Smooth Goodness of Fit Tests

where λ ≥ 0 is the dispersion parameter, C(λ) is a normalizing constant and d(μ, σ)


is an arbitrary right invariant distance. In the particular case where we use Kendall as
the distance function, the model is called the Mallows’ φ-model (Mallows, 1957). Note
that Mallows’ φ-models also belong to the class of paired comparison models (Critchlow
et al., 1991). Critchlow and Verducci (1992) and Feigin (1993) provided more details
about the relationship between distance-based models and paired comparison models.
Motivated from Neyman’s smooth alternative model (4.1), the distance-based model
in (4.25) can be used to test for the uniform null distribution t!1 , i.e., λ = 0.
In distance-based models, the ranking probability is largest at the modal ranking
μ0 and the probability of a ranking will decay the further it is away from the modal
ranking μ0 . The rate of the decay is governed by the parameter λ. For a small value of
λ, the distribution of rankings will be more concentrated around μ0 . When λ becomes
very large, the distribution of rankings will look more uniform. The closed form for the
normalizing constant C(λ) only exists for some distances. In principle, it can be solved
numerically by summing the value e−λd(μ,μ0 ) over all possible μ in P. This numerical
calculation could be time-consuming, as the computational time increases exponentially
with the number of objects. In fact, various methods to avoid the problem of normalizing
constant estimation have been proposed in the literature. For example, it suffices to
estimate the ratio of two normalizing constants evaluating at two consecutive iterates of
a simulated annealing algorithm, see Yu and Xu (2018). If the objective is to estimate
μ0 only, its estimation does not depend on λ and hence the computation of C(λ) is not
needed.
Given a ranking data set {μk , k = 1, . . . , n} and a known modal ranking μ0 , the
maximum likelihood estimator (MLE) λ̂ of the distance-based model can be found by
solving the following equation:

1
n
d(μk , μ0 ) = Eλ̂,σ [d(μ, μ0 )], (4.26)
n k=1

which equates the observed mean distance with the expected distance calculated under
the distance-based model in (4.24).
The MLE can be found numerically because the observed mean distance is a constant
and the expected distance is a strictly decreasing function of λ̂. For the ease of solving,
we re-parametrize λ with φ where φ = e−λ . The range of φ lies in (0, 1] and the value of
φ̂ can be obtained using the method of bisection. Critchlow (1985) suggested applying
the method with 15 iterations, which yields an error of less than 2−15 . Also, the central
limit theorem holds for the MLE λ̂, which is shown in Marden (1995).

83
4. Smooth Goodness of Fit Tests

If the modal ranking μ0 is unknown, it can be estimated by the MLE μ̂0 which
minimizes the sum of the distances over P, that is:


n
μ̂0 = argmin d(μk , μ0 ). (4.27)
μ0 ∈P
k=1

For large values of t, a global search algorithm for the MLE μ̂0 is not practical because
the number of possible rankings is too large. Instead, as suggested in Busse et al.
(2007), a local search algorithm should be used. They suggested iteratively searching

for the optimal modal ranking with the smallest sum of distances nk=1 d(μk , μ0 ) over
μ0 ∈ Π(m) , where Π(m) is the set of all rankings having a Cayley distance of 0 or 1 to
the optimal modal ranking found in the mth iteration:

(m+1)

n
μ̂0 = argmin d(μk , μ0 ).
μ0 Π(m) k=1

(0)
A reasonable choice of the initial ranking μ̂0 can be formed by ordering the mean ranks.
Recently, Yu and Xu (2018) found in their simulation that this method may cause the
(m+1)
μ̂0 stuck at a local minimum and cannot reach the global minimum. Yu and Xu
(2018) proposed to use simulated annealing, a faster algorithm to find the global solution
of the minimization problem in (4.27). Their simulation results revealed that simulated
annealing algorithm always performs better than the local search algorithm even when
the number of objects t becomes large. The local search algorithm generally performs
satisfactory for small t but its performance deteriorates heavily when t gets large, say
t ≥ 50.
Distance-based models can handle partially ranked data in several ways, with some
modifications in the distance measures. Beckett (1993) estimated the model parameters
using the EM algorithm. On the other hand, Adkins and Fligner (1998) offered a non-
iterative maximum likelihood estimation procedure for Mallows’ φ-model without using
the EM algorithm. Critchlow (1985) suggested replacing the distance metric d by the
Hausdorff metric d∗ . The Hausdorff metric between two partial rankings μ∗ and σ ∗
equals
d∗ (μ∗ , σ ∗ ) = max[ max∗ min∗ d(μ, σ), max∗ min∗ d(μ, σ)]. (4.28)
μ ∈μ σ ∈σ σ ∈ σ μ ∈μ

4.4.2. φ-Component Models


Fligner and Verducci (1986) extended the distance-based models by decomposing the
distance d(μ, σ) into (t − 1) distances,


t−1
d(μ, σ) = di (μ, σ), (4.29)
i=1

84
4. Smooth Goodness of Fit Tests

where the di (, σ)’s are independent. Note that Kendall distance and Cayley distance
can be decomposed in this form.Fligner and Verducci (1986) developed two new classes
of ranking models, called φ-component models and cyclic structure models, for the de-
composition.
Fligner and Verducci (1986) showed that the Kendall distance satisfies (4.29):


t−1
dK (μ, μ0 ) = Vi , (4.30)
i=1

where

t
Vi = I{[μ(μ−1 −1
0 (i)) − μ(μ0 (j))] > 0}. (4.31)
j=i+1

We note that Vj has the uniform distribution on the integers 0, 1, . . . , t − j (see Feller
(1968), p. 257), and V1 represents the number of adjacent transpositions required to
place the best object in μ0 in the first position, then remove this item in both μ and
μ0 , and V2 is the number of adjacent transpositions required to place the best remaining
object in μ0 in the first position of the remaining items, and so on. Therefore, the
ranking can be described as t − 1 stages, V1 to Vt−1 , where Vi = m can be interpreted
as m mistakes made in stage i.
By applying the dispersion parameter λi at stage Vi , the Mallow’s φ-model can be
extended to: t−1
e− i=1 λi Vi
π(μ|λ, μ0 ) = , (4.32)
CK (λ)
where λ = {λi , i = 1, . . . , t − 1} and the normalizing constant CK (λ) is equal to

"
t−1
1 − e−(t−i+1)λi
. (4.33)
i=1
1 − e−λi

Once again, the model in (4.31) can be expressed as



e−θ V
π(μ|λ, μ0 ) = ,
CK (λ)

where θ = (λ1 , . . . , λt−1 ) and V = (V1 , . . . , Vt−1 ) .


These models were named t − 1 parameter models in Fligner and Verducci (1986),
but were also named φ-component models in other papers (e.g., Critchlow et al., 1991).
Mallow’s φ-models are special cases of φ-component models when λ1 =. . . =λt−1 .

85
4. Smooth Goodness of Fit Tests

Based on a ranking data set {μk , k = 1, . . . , n} and a given modal ranking μ0 , the
maximum likelihood estimates λ̂i , i = 1,2, . . . ,t − 1 can be found by solving the equation

1
n
e−λ̂i (t − i + 1)e−(t−i+1)λ̂i
Vk,i = − , (4.34)
n k=1 1 − e−λ̂i 1 − e−(t−i+1)λ̂i

where

t
Vk,i = I{[μk (μ−1 −1
0 (i)) − μk (μ0 (j))] > 0}. (4.35)
j=i+1

The left- and right-hand sides of (4.33) can be interpreted as the observed mean and
theoretical mean of Vi respectively.
The extension of distance-based models to t − 1 parameters allows more flexibility
in the model, but unfortunately, the symmetric property of distance is lost. Notice here
that the so-called “distance” in φ-component models can be expressed as

λi I{[μ(μ−1 −1
0 (i)) − μ(μ0 (j))] > 0}, (4.36)
i<j

which is obviously not symmetric, and hence it is not a proper distance measure. For
example, in φ-component model, let μ = (2, 3, 4, 1), μ0 = (4, 3, 1, 2).

d(μ, μ0 ) = λ1 V1 + λ2 V2 + λ3 V3 = 3λ1 + 0λ2 + 1λ3 = 1λ1 + 2λ2 + 1λ3


= d(μ0 , μ).

The symmetric property of distance is not satisfied. Lee and Yu (2012) and Qian and
Yu (2018) introduced new weighted distance measures which can retain the properties
of a distance and also allow different weights for different ranks.

4.4.3. Cyclic Structure Models


Cayley’s distance can also be decomposed into t − 1 independent metrics. Fligner and
Verducci (1986) showed that dC (μ, μ0 ) can be decomposed as


t−1
dC (μ, μ0 ) = Xi (μ, μ0 ), (4.37)
i=1

where Xi (μ, μ0 ) = I(i = max{σ(i), σ(σ(i)), . . .}), and σ(i) = μ(μ−1


0 (i)).
This generalization can be illustrated by an example found in (Fligner and Verducci
(1986)). Suppose there are t lockers, and each locker has one key that can open it. The
μ(μ−1
0 (i))th key will be placed inside the i
th
locker. Without loss of generality, let the
cost of breaking a locker be one. The minimum possible total cost of opening all lockers

86
4. Smooth Goodness of Fit Tests

will then be dC (μ, μ0 ), and it can be decomposed as the sum of costs of opening locker
i, i = 1,2, . . . t − 1, which equals Xi (μ, μ0 ).
If we relax the assumption that the costs of breaking every locker are equal, the total
cost will become

t−1
θi Xi (μ, μ0 ), (4.38)
i=1

where θi is the cost of opening locker i. This “total cost” can be interpreted as a
weighted version of Cayley’s distance. Similar to the extension of Mallow’s φ models to
φ-component models, Fligner and Verducci (1986) developed the cyclic structure models
using the weighted Cayley’s distance. Under this model assumption, the probability of
observing a ranking μ is
t−1
θi Xi (μ,μ0 )
e− i=1
π(μ|θ, μ0 ) = , (4.39)
CC (θ)

where θ = {θi , i = 1, . . . , t − 1} and CC (θ) is the normalizing constant, which equals

"
t−1
{1 + (t − i)e−θi }. (4.40)
i=1

For a ranking data set {μk , k = 1, . . . , n} with a given modal ranking μ0 , the MLEs
θ̂i , i = 1,2, . . . ,t − 1 can be found from the equation

X̄i
θ̂i = log(t − i) − log , (4.41)
1 − X̄i

where n
k=1 Xi (μk , μ0 )
X̄i = . (4.42)
n

4.5. Goodness of Fit Tests for Two-Way Contingency


Tables
We may also consider doubly ordered two-way r × c contingency tables of counts which
we denote by
{Nij } , i = 1, . . . , r, j = 1, . . . , c
We are interested in testing for independence and consequently define the cell prob-
abilities as k k 
1 2

πij (θ) = exp θuv gu (i) hv (j) pi. p,j (4.43)


u=1 v=1

87
4. Smooth Goodness of Fit Tests

where

r 
c 
r 
c
pij = pi. = p.j = 1.
i=1 j=1 i=1 j=1

The {gu (i)} are orthonormal functions on the marginal row probabilities and the {hv (j)}
are orthonormal functions on the marginal column probabilities. The test for indepen-
dence is then a test of
H0 : θ = 0 vs H1 : θ = 0
where θ = (θ11 , . . . , θ1k2 , . . . , θk1 1 , . . . , θk1 k2 ) .
Set
 r  c

V̂uv = Nij ĝu (i) ĥv (j) / n
i j
r c
where i j Nij = n. Here, {ĝu (i)} are the set of polynomials orthogonal to {p̂i. } where
 ( )
p̂i. = j Nij /n. Similarly, ĥv (j) are the set of polynomials orthogonal to {p̂.j } where

p̂,j = i Nij /n. The following theorem, proven in (Rayner et al. (2009b)), shows that
we may obtain the usual test statistic as a consequence of the smooth model (4.43).
 1  k2
Theorem 4.3. The score statistic for testing H0 vs H1 is given by ku=1 2
v=1 V̂uv where
under the null hypothesis, the components V̂uv are asymptotically i.i.d. standard normal
variables.
When k1 = (r − 1) , k2 = (c − 1) , the test statistic is the usual Pearson statistic

  (Nij − np̂i. p̂.j )2


XP2 = → χ2k1 k2
i j
np̂i. p̂.j

Chapter Notes
1. Smooth tests for goodness of fit were introduced by Neyman (1937) when no nui-
sance parameters were involved. Since a probability integral transformation can
transform a distribution to a uniform, the orthogonal polynomials were taken to
be those of Legendre. The tests developed were locally uniformly most powerful,
symmetric, and unbiased. A good introduction along with several references are
given in Rayner et al. (2009b). See also Rayner et al. (2009a) and Rayner et al.
(2009b) for generalizations and extensions of smooth tests of fit including smooth
tests in randomized blocks with ordered alternatives.

2. The polynomials may be taken to be the discrete or Hermite polynomials whose


first two components are
 1
k+1 12
g1 (j) = j− ,
2 k2 − 1

88
4. Smooth Goodness of Fit Tests
 2   5
k+1 k2 − 1 180
g2 (j) = j− − .
2 12 (k 2 − 1) (k 2 − 4)

Higher order components may be obtained using the recurrence equations as de-
scribed in (Rayner et al. (2009b), p. 243). Additional polynomials may be obtained
from the usual three term recursion formulas (Kendall and Stuart, 1979).

3. We note that the smooth testing approach here leads to a study of global properties
of the data. This is to be contrasted with the approach in Chapter 7 whereby we
will consider smooth tests that incorporate specific score functions such as those
of Spearman and Kendall. In those instances, the vector parameter θ places a
weighting on the components of the score function. The approach enables us to
study more precisely local properties.

4. Lancaster (1953) considered a decomposition of the chi-square statistic in connec-


tion with testing for goodness of fit in contingency tables which helps to assess the
individual contributions of the components.

5. A simple application of Neyman’s smooth tests is to the problem of combining


p-values which under the null hypothesis are uniformly distributed (Rayner et al.
(2009a), p. 63).

4.6. Exercises
Exercise 4.1. Prove Theorem 4.2.

Exercise 4.2. Suppose that we are given the smooth binomial distribution given in
Table 4.2. Find the score statistic to test the hypothesis that θ = 0.

Exercise 4.3. Repeat Exercise 4.2 using a sample of size n from the smooth Poisson
distribution given in Table 4.2 and test the hypothesis that θ = 0.

89
5. One-Sample and Two-Sample
Problems
In this chapter we consider several one- and two-sample problems in nonparametric
statistics. Our approach will have a common thread. We begin by embedding the
nonparametric problem into a parametric paradigm. This is then followed by deriving
the score test statistic and finding its asymptotic distribution. The construction of
the parametric paradigm often involves the use of composite likelihood. It will then
be necessary to rely on the use of either linear rank statistics or U -statistics in order
to determine the asymptotic distribution of the test statistic. We shall see that the
parametric paradigm provides new insights into well-known problems. Starting with
the sign test, we show that the parametric paradigm deals easily with the case of ties.
We then proceed with the Wilcoxon signed rank statistic and the Wilcoxon rank sum
statistic for the two-sample problem.

5.1. Sign Test


Suppose that we have a random sample X1 , . . . , Xn from a population having a (not
necessarily continuous) distribution FX (x) with a unique median M (i.e., FX (M ) = 0.5).
We would like to test the hypotheses

H0 : M = M0 versus H1 : M > M0 . (5.1)

Under the parametric paradigm, we first define a score function, sensitive to changes
in the median. Let Y = sgn (X − M0 ) and define the change of measure by

π (y; θ) = exp (θy − K(θ)) g(y), y = −1, 0, 1, (5.2)

where g(y) represents the null probability distribution of Y . Let g(1) = p+ , g(0) = p0
and g(−1) = p− such that
p+ + p0 + p− = 1.

© Springer Nature Switzerland AG 2018 91


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_5
5. One-Sample and Two-Sample Problems

It is natural to assume that p+ = p− . Here K (θ) satisfies

p+ eθ + p− e−θ + p0 = eK(θ) .

Since p+ + p− + p0 = 1 and p+ = p− , it follows that K (0) = K  (0) = 0. It is easy to see


that testing the hypotheses in (5.1) is the same as testing

H0 : π (1; θ) = π (−1; θ)

versus
H1 : π (1; θ) > π (−1; θ)
or equivalently
H0 : θ = 0 versus H1 : θ = 0
In a sample of size n, let n+ , n− , n0 denote the observed number of cases where
y = 1, −1, 0, respectively. Note that under H0 , the score function is given by

U (θ; X) = n+ − n−

and the Fisher information is I = nK  (0) = n(1− p̂0 ) = n−n0 . The score test statistic is
2 2
[U (θ; X)]2 (n+ − n− )2 4 n+ + n20 − n2 4 n+ − n−n
2
0

Ssign = = = = . (5.3)
I n − n0 n − n0 n − n0

Consequently large values of Ssign lead to rejection of the null hypothesis H0 . We see
then that under H0 ,
L
Ssign −
→ χ21
as n → ∞.
Remark 5.1. We note that the sign test takes into account situations where ties (i.e.,
Xi = M0 ) are possible and the score function leads naturally to the statistic n+ + n20
often suggested without justification in the literature. Note that the last expression of
the score test statistic Ssign in (5.3) seems to recommend the usual treatment of ties,
namely to reduce the sample size by discarding the tied observations.
Remark 5.2. In the case of no ties (i.e., n0 = p0 = 0), the score test statistic Ssign
only depends on n+ which has a binomial distribution with probability p+ , i.e., n+ ∼
Bin(n, p+ ). Under H0 , we have p+ = 0.5 and hence n+ ∼ Bin(n, 0.5). For H1 : M > M0 ,
we have p+ > 0.5 and hence large values of n+ will lead to rejection of H0 . The p-value
of the sign test is then Pr(B ≥ n∗+ ) where B ∼ Bin(n, 0.5) and n∗+ is the observed value
n+ . Similarly, for H1 : M < M0 , the p-value is Pr(B ≤ n∗+ ) and for H1 : M = M0 , the
p-value is 2 × min{Pr(B ≥ n∗+ ), Pr(B ≤ n∗+ )}.

92
5. One-Sample and Two-Sample Problems

Example 5.1. The weekly sales of new mobile phones at a mobile phone shop in a mall
are collected over the past 12 weeks. The number of phones sold is recorded for each of
12 weeks and are given below:

45 32 39 29 64 55 38 212 187 124 320 188

Last year, the median weekly sales was 50 units. Is there sufficient evidence to conclude
that median sales this year are higher than last year? Test the hypothesis at a significance
level α = 0.05.
Solution. Let M be population median weekly sales this year. The hypotheses are

H0 : M = 50 versus H1 : M > 50,

n = 12 and n+ = 7. Set B ∼ Bin(12, 0.5). Then, the p-value for the test is given by

12  
12
p−value = Pr(B ≥ 7) = 0.512 = 0.3872.
i=7
i

Since the p-value is greater than 0.05, we do not have enough evidence to reject H0 at
the 5% level of significance. Thus, there are no grounds upon which to conclude that
the median sales are now higher than 50 units per week.

In R, the function pbinom(x,n,p) will calculate Pr(X ≤ x) where X ∼ Bin(n, p).

Remark 5.3. In the presence of ties, it is easy to show that under H0 , n+ + n20 has mean
n
2
and variance n4 (1 − p0 ) but it no longer has a binomial distribution. When the sample
size n is large, we can apply the normal approximation to its null distribution.

Example 5.2. A food product is advertised to contain 75 mg of sodium per serving,


while preliminary studies indicate that servings may contain more than that amount.
We can formulate this problem as a test for the median M of the amount of sodium per
serving:
H0 : M = 75 versus H1 : M > 75.
Suppose that 40 packages of the company’s food product are examined of which 26
packages were observed to contain sodium amounts per serving exceeding 75 mg. Then
using the normal approximation to the binomial probability, the p-value of sign test is

Pr(B ≥ 26) = Pr(B > 25.5) (continuity correction)


25.5 − 20
= Pr(Z > √ ) (Z ∼ N (0, 1) under H0 )
0.25 × 40
= Pr(Z > 1.7393) = 0.041

93
5. One-Sample and Two-Sample Problems

which is close to the exact p-value 0.040. Since the p-value is less than 0.05, we conclude
that the median amount of sodium per serving is greater than 75 mg at the 5% level of
significance.
In R, the function pnorm(x,μ,σ) will calculate Pr(X ≤ x) where X ∼ N (μ, σ 2 ).

Remark 5.4. Since only the signs of {Xi − M 0 } but not its magnitude are used, the sign
test has the advantage that it can be utilized when only the signs are available. The
test is also robust against outliers, as only the sign of the outlier is of interest no matter
how far it is away from M0 . When either p0 = 0 or n0 = 0, we obtain the usual sign test
(Hájek and Sidak (1967), p. 110).
Remark 5.5 (Paired Comparisons). There are several applications of the sign test for
paired comparisons whereby the null hypothesis is that the distribution of the difference
Z = Y − X is symmetric about 0. As a special case, we can consider the shift model
where Z = (Y − Δ) − X so that the treatment adds a shift of value Δ to the control.
The asymptotic properties of the sign test are discussed in several textbooks. See, for
example, Chapter 14 of van der Vaart (2007) and Chapter 4 of Lehmann (1975).

5.1.1. Confidence Interval for the Median


Definition 5.1. A confidence interval for a parameter φ is called distribution-free if

Pr(L < φ < U ) = 1 − α

is true no matter what the distribution F is where L = f1 (X1 , . . . , Xn ) and U =


f2 (X1 , . . . , Xn ) are statistics.
Suppose the distribution of the data FX (x) is continuous. The procedure for con-
structing a symmetric two-sided distribution-free confidence interval for the median M
with level of confidence 1 − α is given below:

(a) Find cα such that Pr(cα ≤ B ∗ ≤ n − cα ) = 1 − α, with B ∗ ∼ Bin(n, 0.5).

(b) Let X(1) < X(2) < · · · < X(n) be the order statistics.

(c) (X(cα ) , X(n+1−cα ) ) is a 100(1 − α)% confidence interval for M as it satisfies


Pr(X(cα ) < m < X(n+1−cα ) ) = 1 − α.

We may show
Pr(X(cα ) < M < X(n+1−cα ) ) = 1 − α.
By definition, cα is chosen such that

Pr(cα ≤ B ∗ ≤ n − cα ) = 1 − α.

94
5. One-Sample and Two-Sample Problems

Now

B ∗ ≥ cα ⇔ number of (Xi > M ) ≥ cα


⇔ M < at least cα of Xi ⇔ M < X(n+1−cα ) < · · · < X(n)
⇔ M < X(n+1−cα ) .

Similarly,

B ∗ ≤ n − cα ⇔ number of (Xi > M ) ≤ n − cα


⇔ M < at most n − cα of Xi
⇔ M > at least cα of Xi ⇔ X(1) < · · · < X(cα ) < M
⇔ M > X(cα ) .

Therefore, the confidence interval is X(cα ) < M < X(n+1−cα ) .

Example 5.3. Suppose we have a set of n = 7 observations

{2, −9, 11, 40, 10, 18, 0}.

Then, if B ∗ ∼ Bin(7, 12 ),

Pr{B ∗ ≥ 6} = 0.0625 ⇔ Pr(2 ≤ B ∗ ≤ 5) = 1 − 2 × 0.0625 = 0.875.

Thus cα = 2 and the 87.5 confidence interval for M is

(X(2) , X(7+1−2) ) = (X(2) , X(6) ) = (0, 18).

Remark 5.6. For a large sample, we may approximate B ∗ ∼ Bin(n, 12 ) by B ∗ ∼ N ( n2 , n4 ).


Hence cα can be approximated by

n n  12
cα ≈ − z α2
2 4
since we have
 
B∗ − n
Pr −z α2 ≤ 1
2
≤ z α2 =1−α
( n4 ) 2
 n  12 n  12 
n n∗
⇔ Pr − z α2 ≤ B ≤ + z α2 = 1 − α.
2 4 2 4

Remark 5.7. A similar procedure can be formulated to determine a confidence interval


for the (100p)th percentile or cdf of a continuous distribution.

95
5. One-Sample and Two-Sample Problems

5.1.2. Power Comparison of Parametric and Nonparametric


Tests
5.1.2.1. Parametric Test for Large Samples: CLT Test
Let X1 , X2 , . . . , Xn denote a random sample from a distribution with mean μ and vari-
d
ance σ 2 . The central limit theorem (CLT) states that for large samples, X̄ → N (μ, σ 2 ),
regardless of the form of the underlying distribution.
For a sample from a symmetric distribution, the population mean and the population
median are the same, and thus any test for the mean is also a test for the median. The
hypotheses
H0 : M = M0 vs H1 : M > M0
for the sign test are therefore equivalent to:

H0 : μ = μ0 vs H1 : μ > μ0

where μ0 = M0 is the hypothesized value of the population mean of X. When the


standard deviation σ is known, we use the test statistic:

X̄ − μ0
Z= √ .
σ/ n

The null hypothesis H0 is rejected in favor of H1 at the significance level α if Z > zα ,


the upper 100α% point of the standard normal distribution.
Remark 5.8. The CLT test is not applicable for populations that come from a distribution
with an infinite variance, such as the Cauchy distribution.

5.1.2.2. Comparison of Sign Test and CLT Test


To choose between the sign test and the CLT test, we pay attention to two statistical
issues: the Type I error which consists of rejecting H0 when H0 is true and the power
which is defined to be the probability of rejecting H0 .

For our discussion, we shall only consider the large sample case with σ known. The
probability of committing a Type I error should be at least close to the significance level
α. Though bias may occur when using an approximation, the normal approximation for
both CLT and sign tests is considered to be quite good for large samples. Therefore, the
stated probability of committing a Type I error will be essentially correct.
To compare the power of the two tests, we must consider the population distribution
under H1 . Referring to Example 5.2, let us assume the true median of sodium content

96
5. One-Sample and Two-Sample Problems

is 75.8 mg with σ = 2.5 and the sample is selected from a normal population. Then

X̄ − 75
power of CLT test = Pr( √ ≥ 1.645 | μ = 75.8)
2.5/ 40
X̄ − μ μ − 75
= Pr( √ ≥ 1.645 − √ | μ = 75.8)
2.5/ 40 2.5/ 40
75.8 − 75
= 1 − Φ(1.645 − √ ) = 0.65
2.5/ 40

Consider the sign test, if μ = 75.8, we have p ≡ Pr(X > 75) = 0.626.

B − 20
power of sign test = Pr( √ ≥ 1.645 | B ∼ Bin(40, 0.626))
0.25 × 40

= Pr( √B−40p ≥ 1.645 p(1−p)
0.25
− √40p−20 | B ∼ Bin(40, 0.626))
40p(1−p) 40p(1−p)
5
0.25 40p − 20
= 1 − Φ(1.645 −- ) = 0.48
p(1 − p) 40p(1 − p)

In this example, the CLT test is preferred since the power is greater.

Remark 5.9. When samples are taken from normal populations with a known variance,
the CLT test has the greatest power among all tests (uniformly most powerful test). It’s
the test to use when sampling from a normal population. But for nonnormal populations,
this is not the case. The sign test will have higher power than the CLT test for heavy-
tailed distributions, including the Cauchy or Laplace distributions. For example, if the
true distribution is Laplace, the power of the sign test is 0.76.

5.2. Wilcoxon Signed Rank Test


As we saw in Section 5.1, the sign test can be derived from the theory of U -statistics by
defining the kernel function h(Zi ) on the sign of the difference Zi = Xi − M,

h(Zi ) = sgn(Zi ), i = 1, 2, . . . , n.

The Wilcoxon signed ranked test which takes into account both the rank and the sign is
more powerful. It is used to test the hypothesis that the distribution of X is symmetric
about M . Consider the two-variable kernel

Yij = h(Zi , Zj ) = I(Zi + Zj > 0), i ≤ j, i, j = 1, 2, . . . , n

97
5. One-Sample and Two-Sample Problems

where (Zi + Zj ) are the Walsh sums and define the smooth model

π(yij ; θ) = exp [θyij − K(θ)] gY (yij ), i ≤ j, i, j = 1, 2, . . . , n

where gY is the density of Y assumed to be symmetric around 0 under the null hypothesis,

H0 : θ = 0.

It is easy to see that


1
K (0) = 0, K  (0) = .
2
The log of the composite likelihood function is proportional to
 

 n+1
(θ) ∼ θ I(zi + zj > 0) − K(θ) .
i≤j
2

The score vector is 


Wn+ = I (zi + zj > 0) . (5.4)
i≤j

The usual Wilcoxon signed-rank statistic, SR+ , is defined as


n
SR+ = Ri+ I(Zi > 0),
i=1

where Ri+ be the rank of |Zi | among {|Zj | , j = 1, . . . , n}. The following lemma shows
an equivalent form between Wn+ and SR+ .

Lemma 5.1. Wn+ = SR+ .

Proof. Suppose without loss of generality that the Zi s are ordered in absolute value:

|Z1 | < . . . < |Zn |.

Fix j and consider all the pairs (Zi , Zj ) , i ≤ j. The sum Zi + Zj > 0 if and only if
Zj > 0. The number of indices for which this is true is equal to j, the rank of |Zj |.

Though itself not a U -statistic, Wn+ is, in fact, the sum of two U -statistics:

Wn+ = I(Zi + Zj > 0)
i≤j
 
= I(Zi > 0) + I(Zi + Zj > 0) (5.5)
i i<j
= nU1n + (n2 ) U2n

98
5. One-Sample and Two-Sample Problems

with respective kernels I(Zi > 0) and I(Zi + Zj > 0). Under the null hypothesis, we
have that
   
 n  n (n − 1)
E I (Zi > 0) = , E I (Zi + Zj > 0) =
i
2 i<j
4

and hence
n (n + 1)
E Wn+ = .
4
As well,

V ar(Wn+ ) = V ar (nU1n ) + V ar ((n2 ) U2n ) + 2Cov (nU1n , (n2 ) U2n )


n n (n2 − 1) n (n − 1)
= + +
4 12 8
n (n + 1) (2n + 1)
=
24
It can be shown that for large n, the asymptotic distribution of Wn+ is determined by
the distribution of the second term. The test rejects for large values of Wn+ . Specifically,
as n → ∞
W + − n (n + 1) /4 L
- n −
→ N (0, 1) .
n (n + 1) (2n + 1) /24
An adjustment for the variance can be made in the case of ties. Specifically, we
replace the variance above by
 3 
tl − tl
n (n + 1) (2n + 1) /24 − ,
48
where {tl } represent the number of absolute differences tied for a particular nonzero
rank. See Lehmann (1975).
Remark 5.10. The extension to paired-sample problems is straightforward. Given a ran-
dom sample of paired observations (Xi , Yi ), i = 1, . . . , n, the basic idea of the Wilcoxon
signed rank test for paired samples is to apply the test to the differences Zi = Xi −Yi , i =
1, . . . , n, where the null hypothesis is that the population median of Z is 0. Similarly,
we can also apply the sign test to paired samples in the same manner.

Example 5.4. A speed-typing course was given to 18 clerks in the hope of training them
to be more efficient typists. Each clerk was tested on her typing speed (in w.p.m.) before
and after the course. The results are given in the table below. With a 1% significance
level use Wilcoxon signed rank test to see whether the course was effective.
After 53 33 54 61 55 57 40 59 58 53 59 62 51 43 64 68 48 60
Before 42 35 48 52 60 43 36 63 51 45 56 50 41 38 60 52 48 57

99
5. One-Sample and Two-Sample Problems

Solution. We are proposing to test these hypotheses:


H0 : There is no difference between a clerk’s typing speed before and after the course;
H1 : A clerk’s typing speed generally increases after completing the course.
If we do not assume any statistical distribution (say, a normal distribution) for a
clerk’s typing speed, a nonparametric Wilcoxon signed rank test is applicable. The test
proceeds as follows:
Step 1. The speed differences (Z = After - Before) are first obtained. We are testing

H0 : M = 0 against H1 : M > 0,

where M is the median of Z.


Step 2. Their signs being temporarily suspended, the magnitudes of the differences are
ranked, starting from the smallest. Tied cases are given by the “average” of the ranks
that would have been given if no ties were present.
Zi 11 −2 6 9 −5 14 4 −4 7 8 3 12 10 5 4 16 −1 3
+
Ri 15 2 10 13 8.5 17 6 6 11 12 3.5 16 14 8.5 6 18 1 3.5

Step 3. The signs of the original differences are then restored to the ranks, and the sum
of the positive ranks, Wn+ = 153.5, is the value of the test statistic.
Step 4. Since H1 is a one-sided hypothesis, a one-tailed test is appropriate. As a bigger
SR+ value produces a stronger support for H1 , one can refer to the table of critical
values (see Table 20 of Lindley and Scott (1995)) with n = 18 and α = 0.01 for decision
making. This critical value is 139. We therefore reject H0 at 1% significance level and
conclude that the course was effective.
The Wilcoxon signed rank test can be implemented using R function wilcox.test. For
example, wilcox.test(x, mu=0, alternative = "greater") produces an exact p-
value of the test with alternative being the median greater than 0 if the sample contains
less than 50 observations and has no ties. Otherwise, a normal approximation is used.

Remark 5.11. One can also apply the sign test in this example. For n = 18 the data
exhibit 14 positive signs, which is insignificant at 1% level (critical value being 15). It
is interesting to see that the sign test uses less information derived from data than the
Wilcoxon signed rank test and the strength of evidence supporting H1 is also lower.

5.3. Two-Sample Problems


Suppose that we are presented with two independent random samples, X1 , . . . , Xn and
Y1 , . . . , Ym with continuous cumulative distributions F and G, respectively. Such would
be the case if one is interested in assessing the effect of a treatment on a group of patients.

100
5. One-Sample and Two-Sample Problems

One sample would serve as the control group and the other as the treatment group. We
would like to test the null hypothesis

H0 : F = G

against the alternative

H1 : F = G.
It is typically assumed that in the shift model, the two distributions only differ by a
location parameter Δ:
F (x) = G (x − Δ) .
That is, X − Δ and Y have the same distribution. If the treatment effect depends on
x, then a more general alternative might be

H1 : F (x) > G (x) or H1 : Δ < 0

which indicates that small values are more likely under F than under G. In that case,
we say that Y is stochastically larger than X. We consider in the next section a test for
this more general alternative.

Example 5.5. Consider the usual two-sample t-test involving independent samples
whereby X1 , X2 , . . . , Xn are iid N (μx , σ 2 ) and Y1 , Y2 , . . . , Ym are iid N (μy , σ 2 ). Then
F (x) = Φ( x−μ
σ
x
) and G(y) = Φ( y−μ σ
y
), where Φ(·) is the cdf of N (0, 1). Then it is easy
to see that Δ = μx − μy .

5.3.1. Permutation Test


We include for completeness a brief discussion of permutation tests. Permutation or
randomization tests as they are sometimes called provide an effective though computa-
tionally intensive method for determining the sampling distribution of a test statistic
under the null hypothesis of “no association.” The basic idea which originated with R.A.
Fisher (1935, Chapter 3) consists of permuting the labels of the observed data and com-
puting the value of the test statistic. This operation is repeated a large number of times
thereby creating a histogram of values that approximates the null distribution of the
test statistic. A p-value may then be calculated by counting the number of cases which
yield values at least as extreme as the one obtained from the observed data. Since the
permutation test is conditional on the observed data, it is distribution-free. Permutation
tests are particularly useful when the data sets are small and the underlying distribution
of the test statistic is uncertain. Further results on permutation tests may be found in
the book written by Mielke and Berry (2001).

101
5. One-Sample and Two-Sample Problems

Example 5.6. Let X1 , . . . , Xn be a random sample from a continuous cdf F (x) and let
Y1 , . . . , Ym be an independent random sample from F (x + Δ), where Δ is finite. Let
N = n + m. We would like to test the hypotheses

H0 : Δ = 0

against
H1 : Δ > 0.
Suppose that the test statistic to be used is given by the difference in the sample means

D = X̄ − Ȳ .

Let Dobs be the observed value of D. Under the null hypothesis, the combined sample
of N observations is assumed to come from the same cdf F . Ignoring the labels which

identify the distribution from which the observations came, there would be Nn possible
samples of n observations assigned to the first population and m observations assigned
to the second. This is precisely the number of possible permutations of the labels and
each is equally likely. For each of these permutations, we may compute a value of the
statistic T thereby creating a reference distribution (which also includes the observed
Dobs ) and then count how many of these exceed the observed Dobs in order to calculate
the p-value.

Example 5.7. Fungal infection was believed to have certain effect on the eating behav-
ior of rodents. In an experiment, infected apples were offered to a group of eight ran-
domly selected rodents, and sterile apples were offered to a group of four. The amounts
consumed (grams of apple per kilogram of body weight) are listed in the table below.
Test whether fungal infection significantly reduces the amounts of apples consumed by
rodents.

Experimental Group (Xi ) 11, 33, 48, 34, 112, 369, 64, 44
Control Group (Yi ) 177, 80, 141, 132

In this example, no assumption is made about the distribution of the data. Given
that the assignment of rodents to either group is random, if there is no difference (H0 )
between the two groups, partitions of the 12 scores into two groups of sizes 8 and 4 will

be equally likely to occur. By permuting the 7 scores, we obtain a total of 12 4
= 495
partitions as follows:

102
5. One-Sample and Two-Sample Problems

Permuted Difference
Experimental Control
Samples Between Means
1 11 33 48 34 112 64 44 80 369 177 141 132 −151.500
2 11 33 48 34 64 44 80 132 112 369 177 141 −144.000
3 11 33 48 34 64 44 80 141 112 369 177 132 −140.625
4 11 33 48 34 112 64 44 132 369 177 80 141 −132.000
.. .. .. ..
. . . .
135 11 33 112 64 44 177 141 132 48 34 369 80 −43.500
136* 11 33 48 34 112 369 64 44 177 80 141 132 −43.125
137 11 34 112 64 44 177 141 132 33 48 369 80 −43.125
138 11 33 48 112 64 177 141 132 34 369 44 80 −42.000
.. .. .. ..
. . . .
48 112 369 64 177 80 141
495 11 33 34 44 109.875
132

* refers to the observed data.

Using the difference between the sample means, i.e., D = X − Y , as our test statistic,
we obtain the following permutation distribution of the difference:
Permutation Distribution of Difference of Means
0.010
0.008
0.006
Density

0.004
0.002
0.000

-150 -100 -50 0 50 100

Xbar-Ybar

It can be seen from the above figure that under H0 small differences tend to occur
more frequently, which is reasonable as the difference will tend to fall around 0 if the
two methods are the same.

103
5. One-Sample and Two-Sample Problems

In a lower-tailed test, to calculate the p-value of the observed difference is equivalent


to find the probability of observing a difference of means of −43.125 or smaller under
the assumption that the two groups do not differ. In this example, we have p-value
= 137
495
= 0.2768.

Remark 5.12. There are other possible choices of the test statistic in the permutation
test:

• Difference of means, D = X − Y

n 
m
• Sum of the observations for one group, T1 = Xi (or T2 = Yj ). Since T =
i=1 j=1
T1 + T2 is fixed given the observed data, we have D = T1
n
− T2
m
= T1 ( n1 + 1
m
) − T
m
and thus D and T1 are equivalent test statistics.

• Two independent sample t-test statistic

• Difference of medians, X0.5 − Y0.5 . The median has the benefit of being robust to
a few extreme observations (called outliers)

• Difference of trimmed means. In a trimmed mean, we remove extreme observations


before taking the average. For example, if we have ten observations given by 2,
8, 9, 10, 11, 11, 12, 13, 14, 200, then to get a 20% trimmed mean we delete 2,
200 and take average, which is 11. The trimmed mean is useful because it is less
sensitive to outliers than the mean.

Generally, deciding which statistic to use requires some advance knowledge of the pop-
ulation. The difference of the means is most commonly used, especially when the data
come from an approximate normal distribution. But if the population has an asymmet-
ric distribution, the median may be a more desirable indicator of the center of the data.
The difference of trimmed means is used when the distribution is symmetric but likely
to have outliers.

Here we summarize the general steps for a two-sample permutation test assuming that
a large test statistic T tends to support Δ > 0:

1. Based on the original data, compute the observed test statistic, Tobs (e.g., the
difference between the two-sample means).

2. Permute the n + m observations from the two treatments so that there are n
observations for population 1 and m observations for population 2. Obtain all

possible permutations, n+m
n
in total. Compute the value of the test statistic T
for each permutation.

104
5. One-Sample and Two-Sample Problems

3. Calculate the p-value based on the test statistic Tobs :

no. of T  s ≥ Tobs
Pupper−tail = m+n
n

no. of T  s ≤ Tobs
Plower−tail = m+n
n

no. of |T | s ≥ |Tobs |
Ptwo−tail = m+n
n

4. Declare the test to be significant if the p-value is less than or equal to the desired
significance level.

Remark 5.13. The permutation test can be rather tedious as sample sizes m and n

increase. For instance, 20 10
= 184,756 is already quite large. Fortunately, there is
a simple way to obtain an approximate p-value in such cases. Rather than using all

the possible permutations, we take a random sample of size say, 1,000 out of m+n m
permutations and find the approximate p-value using the distribution formed by the
1000 statistics in the same manner as the exact permutation test.

5.3.2. Mann-Whitney-Wilcoxon Rank-Sum Test


Let X and Y be two independent random variables with cumulative continuous distri-
butions F and G, respectively. Suppose that we are interested in testing the hypotheses,

H0 : F (x) = G (x) vs H1 : F (x) ≥ G (x) (5.6)

for all x and strict inequality for some x.


Example 5.8. If X ∼ N (μX , σ 2 ) and Y ∼ N (μY , σ 2 ), then the hypotheses become

H0 : μY = μX vs H1 : μY > μX

so that the alternative points to a shift in the mean of the distribution of Y.


We may now embed the nonparametric problem given in (5.6) and we define the
kernel function h (x, y) = I(x < y) for the smooth model as

π (h (x, y) ; θ) = exp [θI(x < y) − K(θ)] g(h(x, y)), (5.7)

where g(h(x, y)) = 12 , x = y is the null density of h(X, Y ) and K(θ) satisfies

K(θ) eθ + 1
e = .
2

105
5. One-Sample and Two-Sample Problems

It follows that K(0) = 0. When θ = 0, the model specified in (5.7) indicates that X
and Y are independent and identically distributed with distribution F . Consequently,
the hypotheses in (5.6) can be expressed in terms of θ as

H0 : θ = 0 vs H1 : θ > 0.

For random samples of sizes n and m from F and G, respectively, the log of the
composite likelihood function becomes proportional to


m 
n
l (θ; X, Y ) ∼ θ I(Xi < Yj ) − nmK(θ).
j=1 i=1

The score test statistic evaluated under H0 is then given by the U -statistic


m 
n
U (X, Y ) = I(Xi < Yj ) = # {(Xi , Yj ) : Xi < Yj } (5.8)
j=1 i=1

which rejects H0 for large values. This is called the Mann-Whitney statistic and it is the
counting form of the statistic. It is this version which is most often used for theoretical
developments as will be seen in Chapter 8 when we consider notions of efficiency.

Lemma 5.2. Let Y1 , . . . , Ym and X1 , . . . , Xn be two independent random samples. Let


R(Yj ) be the rank of Yj in the combined sample and let S(Yj ) be the rank of Yj among
the {Yj }. Denote the combined sample by {Zj } . Then


n
R(Yj ) = S(Yj ) + I(Yj > Xi ).
i=1

Proof. It is easy to see that


n+m
R(Yj ) = I(Yj > Zi ) + 1
i=1

n+m
= I(Yj > Zi ) [I(Zi = Yi ) + I(Zi = Xi )] + 1

i=1

m 
n
= I(Yj > Yi ) + 1 + I(Yj > Xi )
i=1 i=1

n
= S(Yj ) + I(Yj > Xi ).
i=1

106
5. One-Sample and Two-Sample Problems

It can be seen from Lemma 5.2 that the score statistic is equal to


m 
m
U (X, Y ) = R(Yj ) − S(Yj )
j=1 j=1
m (m + 1)
= W− , (5.9)
2
where the sum

m
W = R(Yj ) (5.10)
j=1

is known as the Wilcoxon rank-sum statistic and it is equivalent to the Mann-Whitney


statistic U . Both tests reject the null hypothesis for large values. Properties of the
Wilcoxon test under the alternative can be derived from the properties of U -statistics.
Theorem 5.1. Let Y1 , . . . , Ym and X1 , . . . , Xn be two independent random samples with
the X  s having distribution F and the Y  s having distribution G. Then for the statistic
defined in (5.8)
(a) E(U (X, Y )) = mnq1 , V ar(U (X, Y )) = mnq1 (1 − q1 ) + mn (n − 1) (q2 − q12 ) +
mn (m − 1) (q3 − q12 ) where

q1 = P (X1 < Y1 ) ,
q2 = P (X1 < Y1 , X1 < Y2 ) ,
q3 = P (X1 < Y1 , X2 < Y1 ) .

(b) if min (n, m) → ∞, with m/ (n + m) → λ > 0, then

U (X, Y ) − E(U (X, Y )) L


- −
→ N (0, 1) .
V ar(U (X, Y ))

(c) Under the null hypothesis whereby F (x) = G (x) for all x,

1
q1 =
2
1
q2 = q3 =
3
and
U (X, Y ) − mn
L
 2

→ N (0, 1) .
mn(m+n+1)
12

Proof. The proof is found in (Lehmann (1975), p. 335 and p. 364) and is a direct appli-
cation of Example 3.5.

107
5. One-Sample and Two-Sample Problems

Remark 5.14. The Mann-Whitney test is also based on the permutation test with U
as the test statistic, and its critical value can also be tabulated accordingly. See, for
example, Table 21 of Lindley and Scott (1995). To make inferences, we may either
compute the p-value with the permutation method or compare the observed Mann-
Whitney test statistic Uobs with the corresponding critical value in the table.
The Mann-Whitney test can be implemented using R function wilcox.test. For
wilcox.test(x, y, paired=FALSE): the test statistic is defined as “the number of
pairs (Xi , Yj ) for which Yj ≤ Xi .” Therefore, our Mann-Whitney U statistic can
be obtained using wilcox.test(y, x, paired=FALSE). By default, an exact p-value
is computed if the samples contain less than 50 finite values and there are no ties.
Otherwise, a normal approximation is used.

Remark 5.15. Under H0 , the m ranks associated with the Y -sample should be randomly
selected from the finite population of N = m + n ranks. From the theory of sampling
from finite population, the expected value and variance of W under H0 are given by

E(W ) = mμ
mnσ 2
V ar(W ) = ,
m+n−1
where N
i N +1
μ = i=1 = ,
N 2

N  2
i=1 (i − μ)2 12 + 22 + · · · + N 2 N +1 (N − 1)(N + 1)
2
σ = = − = .
N N 2 12

Hence, under H0 ,

m(m + n + 1)
E(W ) =
2
mn(m + n + 1)
V ar(W ) = .
12
Note that from (5.9), U (X, Y ) = W − m(m+1)
2
. It can be shown that under H0 ,

m(m + 1) mn
E(U (X, Y )) = E(W ) − =
2 2
and
mn(m + n + 1)
V ar(U (X, Y )) = V ar(W ) = .
12

108
5. One-Sample and Two-Sample Problems

Remark 5.16 (Wilcoxon Rank-Sum/Mann-Whitney Test Adjusted for Ties). The case
of ties can be dealt with by counting each tied pair as 12 in the Mann-Whitney statistic
U (X, Y ) or by assigning the average rank (also called mid-rank) to the tied observations
in the Wilcoxon rank-sum statistic W . The mean of U (X, Y ) or W does not change
but the variance of U (X, Y ) or W under H0 should be adjusted downwards:

 
k
mn(m + n + 1) mn i=1 d3i − di
V ar(U (X, Y )) = V ar(W ) = − , (5.11)
12 12 (m + n) (m + n − 1)

where k is the number of tied groups and di is the number of tied observations in the i
th
tied group, i = 1, 2, . . . , k. See (Lehmann (1975) p. 355).

Example 5.9. A statistics course has two tutorial classes conducted by two tutors,
Maya and Alyssa. Students of this course were given a mid-term test and some students
were randomly drawn from each of two tutorial classes and their test scores are shown
below:
Maya’s class (X) 82 74 87 86 75
Alyssa’s class (Y ) 88 77 91 88 94 93 83 94

We are testing the null hypothesis that there is no difference in statistics ability, as
measured by this test, in the two classes.

Solution. n = 5 and m = 8

Combined Data 74 75 77 82 83 86 87 88 88 91 93 94 94
Ranks 1 2 3 4 5 6 7 8.5 8.5 10 11 12.5 12.5
Data from X 82 74 87 86 75
Ranks 4 1 7 6 2
Data from Y 88 77 91 88 94 93 83 94
Ranks 8.5 3 10 8.5 12.5 11 5 12.5

Then W = 8.5 + 3 + 10 + · · · + 12.5 = 71. Using R, we obtain U (X, Y ) = number of


pairs (Xi , Yj ) for which Xi < Yj = 35 and p-value = 0.0333 which is computed using
normal approximation in R as there are ties in the data.
Given the above 13 adjusted ranks, the population mean and variance of these
thirteen numbers are
1 + 2 + · · · + 12.5 + 12.5 1 + 2 + 3 + · · · + 13
μ= = =7
13 13

109
5. One-Sample and Two-Sample Problems

12 + 22 + · · · + 12.52 + 12.52
σ = 2
− 72 = 13.92.
13
The Wilcoxon rank-sum test statistic with adjusted ranks has, under H0 ,

E(W ) = mμ = 8(7) = 56

mnσ 2 8(5)(13.92)
V ar(W ) = = = 46.41
m + n − 1. 12
Although one may apply the sampling formulas directly to the adjusted ranks for tied
data, we can also the explicit formula (5.11) for V ar(W ) under H0 . Note that there are
2 tied groups: {8.5, 8.5} and {12.5,12.5} and hence, k = 2, d1 = d2 = 2. Therefore,
 
k
i=1 di − di
3
mn(m + n + 1) mn
V ar(W ) = −
12 12 (m + n) (m + n − 1)
8(5)(8 + 5 + 1) 8(5)2(23 − 2)
= − = 46.67 − 0.26 = 46.41.
12 12 (13) (12)

Using a normal approximation with continuity correction, letting W ∼ N (56, 46.41),


the p-value of the test is

70.5 − 56
2P (W ≥ 71) = 2P (Z > √ |Z ∼ N (0, 1)) = 0.0333.
46.41

5.3.3. Confidence Interval and Hodges-Lehmann Estimate for


the Location Parameter Δ
Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be two independent random samples drawn from
distributions F (x) and G(y), respectively, where F (x) and G(y) only differ by a location
parameter , i.e., F (x) = G(x − ), or equivalently, X and Y + Δ have the same
distribution. We would like to construct a confidence interval for Δ.
First arrange all mn pairwise differences of the form Xi − Yj from the smallest to
the largest. The median of these mn pairwise differences is called the Hodges-Lehmann
estimate of . It is a robust and nonparametric estimator of the difference of means
X − Y , and is usually computed in conjunction with the confidence interval based on
pairwise differences.
Let D(i) denote the i th smallest pairwise difference. To obtain the confidence interval
for Δ, we look for integers a, b such that

Pr(D(a) <  ≤ D(b) ) = 1 − α.

110
5. One-Sample and Two-Sample Problems

The inequality holds if and only if at least a and at most b − 1 pairs of (Xi , Yj ) satisfy
Xi − Yj <  (or Xi < Yj + ). Since Xi and Yj = Yj +  have the same distribution,
we have
Pr(D(a) <  ≤ D(b) ) = Pr(a ≤ U ≤ b − 1) = 1 − α.
The values of a and b can be obtained according to the null distribution of U (X, Y ) :

a = l α2 + 1, b = u a2

where l α2 and u a2 are the lower- α2 and upper- α2 percentile points of the U (X, Y ) distri-
bution.

Example 5.10. Refer to Example 5.9, we have n = 5, m = 8 and the Hodges-Lehmann


estimate Δ̂ = −7.5. From Table 21 of Lindley and Scott (1995), the lower-2.5th and
upper-2.5th percentiles of U are l0.025 = 6 and u0.025 = mn − 6 = 34. So a = 7 and
b = 34. Thus, a 95% confidence interval for Δ is (-17,-1].

Remark 5.17. To find a 100(1 − α)% confidence interval for  for large samples, we can
use a normal approximation with continuity correction.
Let Z ∼ N (0, 1)). Then

 
a − 0.5 − E(U (X, Y )) b − 0.5 − E(U (X, Y ))
Pr(a ≤ U (X, Y ) ≤ b − 1) = P - <Z< - = 1 − α.
V ar(U (X, Y )) V ar(U (X, Y ))

For α = 0.05, Pr(−1.96 < Z < 1.96) = 0.95 and we can obtain a and b:
-
a = 0.5 + E(U ) − 1.96 V ar(U ),
-
b = 0.5 + E(U ) + 1.96 V ar(U ).

Example 5.11. Referring to Example 5.10, we have E(U (X, Y )) = mn 2


= 20 and
V ar(U ) = 46.41. Then a = 7.15 = 7 and b = 33.85 = 34 so that the confidence
coverage is at least 95%.

5.3.4. Test for Equality of Scale Parameters


The tests we have discussed so far are particularly designed to distinguish the difference
between treatments when the observations from one treatment tend to be larger than
those from the other. However, in some situations, the variability of the observations
from the two treatments is important.

111
5. One-Sample and Two-Sample Problems

Suppose two machines for bottling coca cola are designed to fill the cans with 330
ml of the soft drink. It is expected that the observed data on the amount of coke in
each can from the two machines are centered around 330 ml as they should be, but their
variability may not be the same (see Figure 5.1).

0.08
Distribution with
less variability
0.06 Distribution with
more variability
0.04
0.02
0.00

300 310 320 330 340 350 360

Figure 5.1.: Distributions with different scales

For two independent random samples: X1 , · · · , Xm from treatment 1 and Y1 , · · · , Yn


from treatment 2, assume

Xi = μ + σ1 εi1 i = 1, · · · , m
Yj = μ + σ2 εj2 j = 1, · · · , n

where all the ε’s are i.i.d random variables with a median of 0. Note that both the X’s
and the Y ’s share the same location parameter μ. Here, we want to test H0 : σ1 = σ2 .
A nonparametric test that makes use of Wilcoxon rank-sum test is the Siegel-Tukey test
(Siegel and Tukey, 1960). The steps for carrying out the Siegel-Tukey test are as follows:
1. Arrange the combined data from smallest to largest.

2. Assign rank 1 to the smallest observation, rank 2 to the largest observation, rank
3 to the next largest observation, rank 4 to the next smallest observation, rank 5
to the next smallest observation, and so on.

3. Apply the Wilcoxon rank-sum test based on the rank sum of X.


In this test, we place lower ranks on the more extreme observations and higher ranks on
the middle ones. A smaller rank sum indicates that X has more extreme observations
than Y and hence X tends to have a larger variability than Y .
Remark. If X and Y do not have the same location parameter (median μ), we may apply
the Siegel-Tukey test to X − medX and Y − medY , where medX and medY are sample
medians of X and Y , respectively.

112
5. One-Sample and Two-Sample Problems

Example 5.12. Consider the following two sets of data and calculate the Siegel-Tukey
test statistic:

X: 1.4 7.8 6.3 8.9


Y: 3.9 2.5
   
N 6
Solution. We have m = 4, n = 2, N = 6, and = = 15.
m 4

Combined data: 1.4 2.5 3.9 6.3 7.8 8.9


Siegel-Tukey Ranks: 1 4 5 6 3 2

Hence, W = 1 + 2 + 3 + 6 = 12 and W2 = 4 + 5 = 9. We may apply the permutation


method to construct the null distribution of W :
Ranks of X Ranks of Y W Ranks of X Ranks of Y W
3, 4, 5, 6 1, 2 18 1, 3, 4, 5 2, 6 13
2, 4, 5, 6 1, 3 17 1, 2, 5, 6 3, 4 14
2, 3, 5, 6 1, 4 16 1, 2, 4, 6 3, 5 13
2, 3, 4, 6 1, 5 15 1, 2, 4, 5 3, 6 12
2, 3, 4, 5 1, 6 14 1, 2, 3, 6 4, 5 12
1, 4, 5, 6 2, 3 16 1, 2, 3, 5 4, 6 11
1, 3, 5, 6 2, 4 15 1, 2, 3, 4 5, 6 10
1, 3, 4, 6 2, 5 14

Under H0 , each occurs with probability 1


15
. For H1 : σ1 > σ2 , the p-value is Pr(W ≤
4
12) = 15 = 0.267.
Remark. One of the difficulties of the Siegel-Tukey test is that if the ranking starts the
alternative pattern with the largest observation instead of the smallest, the value of
the Wilcoxon statistic will be different. The Ansari-Bradley test(Ansari and Bradley,
1960) helps overcome this problem. However, the corresponding rank-sum statistic will
no longer follow the same distribution as the Wilcoxon rank-sum statistic. The critical
values of the test can be found from a table in Ansari and Bradley (1960) or using the
permutation method.

Chapter Notes
1. The power efficiency of the sign test relative to the Student’s t test for the case of
the normal distribution is 95% for small samples. The relative efficiency appears
to decrease for increasing sample sizes and is about 75%. The relative efficiency
decreases as well for increasing level of significance and for increasing alternative.
See Wayne Daniel (1990, p. 33).

113
5. One-Sample and Two-Sample Problems

2. The asymptotic efficiency of the Wilcoxon signed test has been investigated by
Noether (see p. 43 of Daniel (1990)). The Wilcoxon signed rank test has an asymp-
totic relative efficiency of 0.955 relative to the one-sample t test if the differences
are normally distributed and an efficiency of 1 if the differences are uniformly dis-
tributed. The asymptotic efficiency of the sign test relative to the Wilcoxon signed
rank test is 2/3 if the differences are normally distributed and an efficiency of 4/3
if the differences follow the double exponential distribution.

3. Tables for the Mann-Whitney test are found in Daniel (1990), p. 508.

4. The consistency of the Mann-Whitney test is discussed in Gibbons and Chakraborti


(2011), p. 141. In fact (U/mn) → q1 as min (m, n) → ∞ since V ar (U/mn) → 0,
thereby showing (U/mn) is a consistent estimator of q1 . Moreover, the test is
consistent for the three subclasses of alternatives q1 < 12 , q1 > 12 , q1 = 12 , since the
power of the test when that alternative is true approaches 1 as the sample size
tends to infinity. See p. 17 and p. 142 in Gibbons and Chakraborti (2011) and
pp. 267–268 in Fraser (1957). By choosing h (x1 , x2 , y1 , y2 ) = I (x1 < y1 , x2 < y1 ) +
I (y1 < x1 , y2 < x1 ), the test is consistent against all alternatives.

5.4. Exercises
Exercise 5.1. Refer to the data in Example 5.1. Last year, the average weekly sales
was 100 units. Is there sufficient evidence to conclude that sales this year exceeds last
year’s sales? Test at α = 0.05. The usual type of test to run for such a question is
a one-sample t-test. For the given problem, the correct hypotheses are H0 : μ = 100
versus H1 : μ > 100 where μ is the true mean weekly sales of new mobile phones at the
mobile phone shop this year.

(a) What is the result of the t-test? (Ans: t = 0.4048, d.f. = 11)

(b) What assumptions have you made in conducting the t-test?

(c) Are these assumptions valid? You may draw the histogram of the sales.

(d) Under what conditions do we feel safe to use the above t-test?

Exercise 5.2. Fifteen patients are trying a new diet. The differences between weights
before and after the diet are (in the order from the smallest)

-7.8 -6.9 -4.0 3.7 6.5 8.7 9.1 10.1 10.8 13.6 14.4 15.6 20.2 22.4 23.5

(a) If the diet has no effect, what should the median weight loss be?

114
5. One-Sample and Two-Sample Problems

(b) Is it a one- or two-sided sign test? Hence, conduct the test at the 5% significance
level.

(c) What assumptions do we need for the sign test in (b)?

(d) Find a 97.9% confidence interval for the population median.

Exercise 5.3. Denote F (x) as the cumulative distribution function (cdf) of a continuous
random variable X. Show that for a given x, an approximate 100(1 − α)% confidence
interval for F (x) : 5
F̂ (x)[1 − F̂ (x)]
F̂ (x) ± z α2 ,
n
where F̂ (x) is the empirical cdf.

Exercise 5.4. Show that under the null hypothesis, the Wilcoxon signed rank statistic
is equivalent in distribution to the sum

W+ = Vj
j

where Vj takes values 0, j with probability 1/2 each.

Exercise 5.5.

(a) Two different groups of individuals were compared with respect to how quickly they
respond to a changing traffic light. Test using the two-sided Wilcoxon statistic the
hypothesis that there is no difference between the two groups on the basis of the
following data
Group 1 19.0 14.4 18.2 15.6 14.5 11.2 13.9 11.6
Group 2 12.1 19.1 11.6 21.0 16.7 10.1 18.3 20.5

(b) Do the data suggest that the population variances differ? Carry out a Siegel-Tukey
test at the 5% significance level.

(c) Suppose that those same groups are compared at a later time. Indicate how you
would combine the two data sets using the smooth embedding approach in order
to define a single test. (Hint: suppose in the embedding approach, the parameters
for the individual groups are θ1 , θ2 , respectively. Define γ1 = θ1 +θ
2
2
, γ2 = θ1 −θ
2
2
.
Then
θ1 = γ 1 + γ 2 , θ 2 = γ 1 − γ 2
and we may phrase the testing problem in terms of γ1 , γ2 .)

115
6. Multi-Sample Problems
In this chapter, we present a unified theory of hypothesis testing based on ranks. The
theory consists of defining two sets of ranks, one consistent with the alternative and
the other consistent with the data itself in a notion to be described. The test statistic
is then constructed by measuring the distance between the two sets. Critchlow (1986,
1992) utilized a different definition for measuring the distance between sets. The problem
can embedded into a smooth parametric alternative framework which then leads to a
test statistic. It is seen that the locally most powerful tests can be obtained from this
construction. We illustrate the approach in the cases of testing for ordered as well
as unordered multi-sample location problems. In addition, we also consider dispersion
alternatives. The tests are derived in the case of the Spearman and the Hamming
distance functions. The latter were chosen to exemplify that different approaches may
be needed to obtain the asymptotic distributions under the hypotheses.

6.1. A Unified Theory of Hypothesis Testing


In this section, we present a general approach for testing hypotheses. We begin by
defining two sets of rankings, one set of rankings most in agreement with the data and
the other set most in agreement with the alternative. Next, we define a distance function
to measure the distance between a ranking from one set and a ranking from the other set.
Finally we compute the test statistic defined to be the average distance over of all such
pairwise rankings, one chosen from each set. The test then rejects the null hypothesis
for small values of this average distance. After describing the specifics of this approach,
we apply this general methodology to various testing problems. Our notation in this
chapter differs from the notation of previous chapters. It will be convenient to denote
observed ranks by μ since we will be constructing a set of permutations generated by it.

6.1.1. A General Approach


The following unified approach was described in Alvo and Pan (1997). Let H0 , H1
be the null and alternative hypothesis respectively in a typical testing situation. Let
Pn = {η : (η(1), . . . , η(n))} be the set of all permutations of the integers 1, . . . , n and
let d (μ, ν) be a measure of distance between permutations μ, ν in Pn .

© Springer Nature Switzerland AG 2018 117


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_6
6. Multi-Sample Problems

• Step 1: Let X1 , . . . , Xn be a collection of random variables from some continuous


distribution and let μ(i) be the rank of Xi among the X  s. The continuity assump-
tion ensures that there are no ties in the permutation μ = (μ (1) , . . . , μ (n)) occur
with probability one.

• Step 2: Define {μ} to be the set of all rankings which are equivalent to the
observed ranking μ in the sense that ranks occupied by identically distributed
random variables are exchangeable.

• Step 3: Define E to be the class of extremal rankings which are most in agreement
with the alternative. The set E is not data dependent.

• Step 4: Define the distances between the subclass {μ} and E


 
d ({μ} , E) = d (μ, ν) .
μ∈{μ} ν∈E

Small values of d ({μ} , E) are consistent with the alternative and lead to the
rejection of the null hypothesis.

We may now integrate the unified theory in the context of the smooth alternative model.
For each fixed μ ∈ {μ}, we may define the smooth alternative density as proportional
to
π (μ; E, θ) ∼ exp [−θ d(μ, E) − K(θ)] , (6.1)
where 
d (μ, E) = d (μ, ν)
ν∈E

and K (θ) is a normalizing constant. Let nμ be the cardinality of {μ} . Under the null
hypothesis, H0 : θ = 0, all the rankings μ ∈ {μ} are equally likely. Under the alternative,
H1 : θ > 0, rankings μ which are closer to E are more likely than those which are further
away. The distance model in (6.1) generalizes the distance-based models of Diaconis and
Mallows described in (Alvo and Yu (2014)) as discussed earlier in Section 4.4.
The logarithm of the composite likelihood function constructed from (6.1) is then
proportional to
 
[θd(μi , E) − K(θ)] = θ d (μi , E) − nμ K (θ)
μi ∈{μ} μi ∈{μ}

and the score statistic obtained by calculating the derivative at θ = 0 becomes


⎛ ⎞

U ({μ}) = ⎝ d (μi , E) − nμ K  (0)⎠ .
μi ∈{μ}

118
6. Multi-Sample Problems

The corresponding test statistic is given by

[U ({μ})] Σ−1 [U ({μ})] , (6.2)

where Σ is the variance-covariance matrix of U ({μ}). In the next two sections, we


consider direct applications of the unified theory. Specifically, we describe the multi-
sample problem of location in both the ordered and unordered case. As well, we describe
a test for umbrella alternatives and conclude the chapter with a discussion of tests for
dispersion.

6.1.2. The Multi-Sample Problem in the Ordered Case


We may consider the general multi-sample location problem with ordered alternatives.
Let F1 (x) , . . . , Fr (x) be r continuous distributions and suppose we wish to test

H0 : F1 (x) = . . . = Fr (x) , for all x

against the alternative


H1 : Fr (x) ≤ . . . ≤ F1 (x)
with strict inequality for some x. Let XNk−1 , . . . , XNk be a random sample of size nk
from Fk (x) where N0 = 0 and

Nk = n1 + . . . + nk , k = 1, . . . , r.

Rank all the Nr observations among themselves and write the permutation of observed
ranks
μ = [μ (1) , . . . , μ (N1 ) | . . . |μ (Nr−1 + 1) , . . . , μ (Nr )]
where ranks from the same distribution are placed together. Hence, μ (1) , . . . , μ (N1 )
represent the observed ranks of the n1 observations from F1 (x) the combined samples.
The equivalent subclass {μ} consists of all permutations of the integers 1, . . . , Nr which
assign the same set of ranks to the individual populations as μ does. On the other hand,
the extremal set E consists of all permutations which assign ranks Nk−1 + 1, . . . , Nk to
population k. Alvo and Pan (1997) derived the test statistics corresponding to the
Spearman, the Spearman Footrule, and the Hamming distances. They also obtained the
asymptotic distributions under both the null and alternative hypotheses.
In order to illustrate the methodology, we consider the two-sample case. Suppose
that we observe independent random variables X1 , X2 , X3, X4 with X1 , X2 from F1 and
X3, X4 from F2 . The alternative hypothesis claims that F2 ≤ F1 . Among all rankings of
the integers (1, 2, 3, 4) one would expect that the small ranks would be most consistent
with F1 and larger ranks would be most consistent with F2 . Adopting the convention

119
6. Multi-Sample Problems

that the first two components of a ranking refer to population 1 and the next two
to population 2, the exchangeable rankings compatible with the alternative hypothesis
would be
(1, 2|3, 4) , (2, 1|3, 4) , (1, 2|4, 3) , (2, 1|4, 3) .
Returning now to the general situation, the Spearman and Hamming test statistics
are as follows.

Spearman: In the case of the Spearman distance, the test statistic in the multi-sample
case was shown to be

Nr
μ (i)
S= c (i) ,
i=1
Nr + 1
where for 1 ≤ k ≤ r,

c (i) = (Nk−1 + Nk ) , for all i such that Nk−1 < i ≤ Nk .

We note that
r Nk
k=1 i=Nk−1 +1 (Nk−1 + Nk )
c̄ =
Nr
r
k=1 nk (Nk−1 + Nk )
= (6.3)
Nr
r
k=1 (Nk − Nk−1 ) (Nk−1 + Nk )
=
Nr
r 2
k=1 N k − N 2
k−1
= = Nr . (6.4)
Nr

We recognize that S is a simple linear rank statistic and consequently, under the null
hypothesis, S is asymptotically normal with mean
 Nr
μ (i)
i=1 Nr c̄
c̄ =
Nr + 1 2

and variance
Nr3 
r
2
σ = wk Wk Wk−1 ,
12 k=1

where as min {n1 , . . . , nr } → ∞

nk  k
→ w k , Wk ≡ wi .
Nr i=1

120
6. Multi-Sample Problems

The test rejects for large values of S. In the two-sample case, the statistic becomes:

Example 6.1. The statistic in the two-sample ordered location problem based on the
Spearman distance is

n1  (n1 + n) 
n1 n1 +n2
S = μ (i) + μ (i)
n + 1 i=1 n + 1 i=n +1
1

n1 n n
= + W
2 n+1
where W is the Wilcoxon statistic for which the exact mean and variance from Sec-
tion 5.3.3 are
n2 (n + 1) n1 n2 (n + 1)
E [W ] = , V ar [W ] = .
2 12

Hamming: In the case of the Hamming distance, the test statistic in the multi-sample
case was shown to be
Nr
H= aiμ(i) ,
i=1

where 
1
nk
i.j ∈ {Nk−1 + 1, . . . , Nk } , 1 ≤ k ≤ r
aij =
0 otherwise
Equivalently, we may express Hamming’s statistic as

Y1 Yr
H= + ... + ,
n1 nr

where Yk is the number of observed rankings in the set {Nk−1 + 1, . . . , Nk }. The test
rejects for large values of H. We may apply Hoeffding‘s theorem (Section 3.3) to obtain
a central limit theorem for the Hamming statistic under the null hypothesis. We have
that H is asymptotically normal with mean

1 
Nr Nr
aij = 1
Nr i=1 j=1

and variance
1  r 
N rN
(r − 1)
d2ij = , (6.5)
Nr − 1 i=1 j=1 Nr − 1

where
1 1 1 1
dij = aij − āi. − āij + ā.. = aij − − + = aij − .
Nr Nr Nr Nr

121
6. Multi-Sample Problems

The result follows from the fact that

max1≤i.j≤Nr d2ij
1
Nr2
max1≤k≤r t−2
k
1
Nr Nr 2 ≤ r−1 → 0.
Nr i=1 j=1 dij Nr

Example 6.2. The statistic in the two-sample ordered location problem based on Ham-
ming’s distance becomes
n
1 +n2

H= aiμ(i)
i=n1 +1

where ⎧

⎪ 1
i, j ∈ {1, . . . , n1 }
⎨ n1
aij = 1
i, j ∈ {n1 + 1, . . . , n1 + n2 }
⎪ n2

⎩0 otherwise.
It follows from Hoeffding’s combinatorial central limit theorem (see Theorem 3.6), H is
asymptotically normal with mean equal to 1 and variance

1
n1 + n2 − 1

as min (n1 , n2 ) → ∞.

6.1.3. The Multi-Sample Problem in the Unordered Case


The solution to the multi-sample problem in the unordered case may be derived by
considering the set of all possible orderings of the alternative. In particular, consider
the following specific ordered testing problem

H0 : F1 (x) = . . . = Fr (x) , for all x


against

H1h : Fehr (x) ≤ . . . ≤ Feh1 (x) ,


where we have a fixed h, 1 ≤ h ≤ r! and permutations [eh1 , . . . , ehr ]. We may now phrase
the unordered testing hypothesis problem as that of testing

H0 : F1 (x) = . . . = Fr (x) , for all x


against


r!
H1 : H1h .
i=1

122
6. Multi-Sample Problems

Let Th be a test statistic for testing H0 against H1h with rejection region {Th < c} . We
define the test statistic TM for testing H0 against H1 using the following three steps:
1. Linearization: Let α̃ = (α1 , . . . , αr! ) and put


r!
TL (α̃) = αh Th .
h=1

2. Normalization: Put
TL (α̃) − E0 [TL (α̃)]
TN (α̃) = - .
V ar0 TL (α̃)

3. Minimization: Put
TM = min TN (α̃) .
α̃

The use of this approach in the case of the Spearman distance leads to the well-known
Kruskal-Wallis test statistic for the one-way analysis of variance. Details of the proof of
the next two theorems may be found in Alvo and Pan (1997).
Theorem 6.1 (Spearman Case). The test statistic in the unordered case based on Spear-
man distance rejects the null hypothesis for large values of TM = μ̄Σ−1
S μ̄ where

μ̄ = (μ̄1 − E0 [μ̄1 ] , . . . , μ̄r−1 − E0 [μ̄r−1 ])

with

Nk
μ̄k = μ (i)
i=Nk−1 +1

and its expectation under null


 
Nr + 1
E0 [μ̄k ] = nk .
2

Also, the (r − 1) × (r − 1) covariance matrix ΣS has components



nk (Nk − nk ) (Nr + 1) /12 k = k
Cov (μ̄k , μ̄k ) =
−nk nk (Nr + 1) /12 k = k 

and inverse    
12 1 1
ΣS−1 = diag ,... + J/Nr
Nr (Nr + 1) n1 nr
Moreover, under the null hypothesis as min {n1 , . . . , nr } → ∞,
L
TM −
→χ2r−1 .

123
6. Multi-Sample Problems

The statistic TM coincides with the well-known Kruskal-Wallis test statistic

 r  2
12 μ̄k Nr + 1
nk − .
Nr (Nr + 1) k=1 nk 2

Proof. The proof is a direct consequence of the multivariate central limit theorem.

Theorem 6.2 (Hamming Case). The test statistic in the unordered case based on Ham-
ming distance rejects the null hypothesis for large values of TM = μ̄Σ−1
H μ̄ where

μ̄ = (μ̄11 − E0 μ̄11 , . . . , μ̄r−1,r−1 − E0 μ̄r−1,r−1 )

with

Nk
μ̄kp = apμ(j)
j=Nk−1 +1

and 
1 j ∈ {Ni−1 , . . . , Ni }
aij =
0 otherwise
expectation
E0 [μ̄kp ] = nk
and covariance



nk np (Nr −nk )(Nr −np )
k = k  , p = p

⎪ Nr2 (Nr −1)

⎨− p (Nr −nk )np
n n
k
Nr2 (Nr −1)
k = k  , p = p
ΣH = Cov (μ̄kp , μ̄k p ) =


n k n (N −np )nk
− Np 2 (Nr r −1) k = k  , p = p


⎪ r
⎩ nk2np nk np
Nr (Nr −1)
k = k  , p = p

Moreover, under the null hypothesis as min {n1 , . . . , nr } → ∞,


L
TM −
→χ2(r−1)2 .

Equivalently, μ̄kp may be interpreted as the number of rankings {μ (Nk−1 + 1) , . . . , μ (Nk )}


in the set {Np−1 + 1, . . . , Np .}

Proof. The proof is a direct consequence of the multivariate central limit theorem.

6.1.4. Tests for Umbrella Alternatives


A second application of the unified theory is to the case of testing for trend when there
is the possibility of an umbrella type alternative. As an example, consider the data on

124
6. Multi-Sample Problems

intelligence scores in Appendix A. Weschler Adult Intelligence Scale scores were recorded
on 12 males listed by age groups. A look at the data reveals that the peak is located in
the 35 − 54 age group. In general, we would like to test the null hypothesis that there is
no difference due to age against the alternative that the scores rise monotonically prior
to the peak and decrease thereafter. Two situations arise: when the location of the peak
is known and when it is unknown.
Let F1 (x) , . . . , Fr (x) be r continuous distributions and suppose we wish to test

H0 : F1 (x) = . . . = Fr (x) , for all x

against the umbrella alternative

H1 : F1 (x) ≥ . . . ≥ Fp (x) ≤ . . . ≤ Fr (x)

with strict inequality for some x. Suppose that there are mi observations from Fi , i =
1, . . . , r and that n = m1 + . . . + mr .
In the case when the location of the peak is known, Alvo (2008) obtained the test
statistic corresponding to both Spearman and Kendall distance functions and showed
that they were asymptotically equivalent under the condition that

min (mi ) −→ ∞
i

in such a way that mni −→ λi > 0, n −→ ∞. Moreover, he derived the asymptotic


distribution of the test statistics under both the null and alternative hypotheses. For
the Spearman distance in the special case mi = m, it was shown that when the location
of the peak is known, the test statistic is
 
i n+1
 
(k + 1 − i)

n+1
Sp = mn μ̄i − + μ̄i −
i≤p
p 2 i>p
(k + 1 − p) 2

where μ̄i represents the average of the ranks in the ith population. Under the null
hypothesis, Sp is asymptotically normal with mean 0 and variance
p  2  2

2  r
m [n (n + 1)] i r + 1 r + 1 − i r + 1
σp2 = − + − .
12 i=1
p 2r i=p+1
r + 1 − p 2r

Alvo (2008) also developed a test when the location of the peak is unknown. This
is based on the statistic  
Sp
Smax = max
p σp

125
6. Multi-Sample Problems

whose asymptotic distribution is that of Bz where z has a standard multivariate normal


distribution and the matrix B derives from the relation
 
S1 Sr
Cov ,..., = BB  .
σ1 σr

The distribution of Bz is generally obtained by simulation.

6.2. Test for Dispersion


In this section, we shall be concerned with detecting differences in variability between
two populations. We shall assume that the two populations of interest have the same
median and are described by cumulative distribution functions FX (x) and

FY (x) = FX (μ + (x − μ) /γ) ,

for some dispersion parameter γ > 0. We would like to test

H0 : γ ≥ 1

H1 : γ < 1
The alternative states that the second population is less spread out than the first. We
shall assume for simplicity of the presentation that both sample sizes are even numbers
with
ni = 2mi , i = 1, 2.
Using the unified theory to the dispersion problem, rank all the observations together
and let items 1, . . . , n1 be from the first population and items n1 + 1, . . . , n1 + n2 be
from the second. The equivalence class {μ} consists of all permutations obtained by
permuting the labels assigned to the items in each population respectively. Moreover,
in view of the assumption that the medians are the same, we also transpose the items
ranked in positions i and n + 1 − i. The extremal class E consists of all permutations
which rank the items from the second population in the middle and the items from the
first population at the two ends. This is because the first population is more diverse.
Using the Spearman distance, it can be shown that the test statistic takes the form

n1  
 n + 1 
S=  Ri − 
 2 
i=1

which is precisely the Freund-Ansari-Bradley test (Hájek and Sidak (1967), p. 95).

126
6. Multi-Sample Problems

6.3. Tests of the Equality of Several Independent


Samples
Alvo (2016) developed an alternative approach to test for the equality of r distributions
which exploits the smooth model directly. Let {Xij } i = 1, . . . , r; j = 1, . . . , ni be
independent random variables such that for fixed i, Xij have a common cdf Fi . We wish
to test the hypotheses
H0 : Fi = F, i = 1, . . . , r
against
H1 : Fi = Fj , for some i = j.
We are concerned with differences in location. Consider first a single observation from
each population. Select the kernel to be the sign function, denoted sgn, which compares
a pair of observations, one from each population and define for the ith population the
smooth alternative density as proportional to
 

π (x1 , . . . , xr ; θi ) ∼ θi sgn (xk − xl ) − K (θi ) .
l
=k

Suppose now that we observe a random sample from each of the r populations:
{Xij , j = 1, . . . , ni }, i = 1, . . . , r. The composite log likelihood function for the ith
population is proportional to


ni 
nl 
r 
r
θi sgn (xij − xlj  ) − ni nl K (θi )
l
=i j=1 j  =1 i=1 l
=i

and hence the composite log likelihood function taking into account all the populations
is proportional to


r 
ni 
nl 
r 
r
l (θ) ∼ θi sgn (xij − xlj  ) − ni nl K (θi ) .
i=1 l
=i j=1 j  =1 i=1 l
=i

The null hypothesis may now be defined as

H0 : θi = θ, i = 1, . . . , r (6.6)
H1 : θi = θj , for some i = j. (6.7)

127
6. Multi-Sample Problems

Under the null hypothesis,


r 
r 
r 
r
θk sgn (xi − xl ) = θ sgn (xi − xl ) = 0
i=1 l
=i i=1 l
=i

which shows that the model can actually be specified by using only (r − 1) parameters.

Consequently, we may redefine the parameters so that θi = 0, and we wish to test
H0 : θi = 0, i = 1, . . . , r. It follows that K (0) = 0.
Suppose now that we observe a random sample from each of the r populations:
Xij , j = 1, . . . , ni , i = 1, . . . , r. The composite log likelihood function is proportional to


r 
ni 
nl 
r 
r
l (θ) ∼ θi sgn (xij − xlj  ) − ni nl K (θ)
i=1 l
=i j=1 j  =1 i=1 l
=i

Since,
 
1 
r l n
n+1
Rij − = sgn (xij − xlj  )
2 2 l
=i j  =1

we have

r ni 
   r  r
n+1
l (θ) ∼ 2 θi Rij − − ni nl K (θ)
i=1 j=1
2 i=1 l
=i
r    r  r
n+1
= 2 ni θi R̄i − − ni nl K (θ)
i=1
2 i=1 l
=i

where R̄i is the average of the overall ranks assigned to the k th population. The score
function is given by the r × 1 vector
    
n+1 n+1
U = n1 R̄1 − , . . . , nr R̄r −
2 2

The g-inverse of the corresponding covariance matrix is given by the diagonal


 
−1 12
I = diag
n (n + 1) nk

Kruskal (1952) and it follows that the Rao score test statistic, denoted KW , is

KW = U  I −1 U

r  2
12 n+1
= ni R̄i − (6.8)
n (n + 1) i=1 2

128
6. Multi-Sample Problems

which is the usual Kruskal-Wallis statistic. Under the null hypothesis,


L
KW −
→ χ2r−1 .

We reject the null hypothesis for large values of KW. The Kruskal-Wallis test is thus
locally most powerful for testing (6.6).
When there are ties in the data, we may adjust the ranks by using the mid-ranks
for the tied data as we did for the Wilcoxon test. The permutation method can also
be applied to the KW statistic. However, in order to maintain the chi-square approxi-
mation, the KW statistic should be modified. Recall that under the one-way ANOVA
model, SSB/σ 2 ∼ χ2 (r − 1) under H0 , and the mean of the ranks is still n+1
2
no matter
whether there are ties or not. It is thus natural to assume that for some constant C

r n+1 2
KWties ≡ C ni (R̄i − )
i=1 2

where to fulfill the chi-square approximation, we must have

E(KWties ) = r − 1.

According to the properties of sampling from finite distributions, we have

n+1 2 (n − ni )σ 2
E(R̄i − ) = V ar(R̄i ) =
2 (n − 1)ni

where

g
(t3i − ti )
n 2
− 1
σ2 = − i=1
12 12n
is the population variance of the combined ranks or adjusted ranks and g is the number
of tied groups. Hence. we have,


r n+1 2 nσ 2
E(KWties ) = E[C ni (R̄i − ) ] = C(r − 1)
i=1 2 n−1
with
n−1
C = .
nσ 2
Thus, a more general form of the KW statistic in the case of ties is given by

KW
KWties = g .
i=1 (ti − ti )
3
1−
n3 − n

129
6. Multi-Sample Problems

Remark. Note that this is only an intuitive derivation of the KW statistic in the case of
ties. For a more formal proof, see Kruskal and Wallis (1952). The Kruskal-Wallis test
can be implemented using either of the R functions below:

• If no ties, use kruskal.test function in the base R installation.

• If ties exist, use kruskal_test function in the R add-on package coin.

The parametric paradigm allows for a further investigation into possible differences
among the populations. We may conduct a bootstrap study by sampling with replace-
ment ni observations from the ith population and computing the overall average ranking
R̄i . We note that the maximum composite likelihood equation for each θi is
 
n+1 ∂K (θi )
R̄i − =
2 ∂θi
 
1  1
= Pθ (Xi > Xl ) − (6.9)
ni l
=i 2

Hence after each bootstrap iteration, we may obtain an estimate of the left-hand side
of (6.9). The histogram of bootstrapped values under the null hypothesis should be
centered around 0. An example illustrates this computation.

Example 6.3. Eighteen lobsters of the same size in a species are divided randomly into
three groups and each group is prepared by a different cook using the same recipe. Each
prepared lobster is then rated on each of the criteria of aroma, sweetness, and brininess
by professional taste testers. The following shows the combined scores for the lobsters
prepared by the three cooks. A higher score represents a better taste of the lobster.

Cook A Cook B Cook C


7.03 8.63 7.75
9.97 6.51 4.37
8.25 5.93 5.07
7.99 9.01 6.31
8.35 8.59 4.59
6.79 7.65 6.63

Based on the data, apply the Kruskal-Wallis test to test the null hypothesis that
median scores for all three cooks are the same at the 5% level of significance.

Solution. Using the function kruskal.test in R, KW = 6.9825, df = 2, p-value =


0.03046 < 0.05. There is enough evidence to conclude that the median scores for all
three cooks are not the same.

130
6. Multi-Sample Problems

6.4. Tests for Interaction


In this section, we consider the general two-factor design with equal numbers of replica-
tions in each cell. Such designs are utilized in statistics to test for main effects and for
interactions in a variety of experiments. In more recent times, they have been applied in a
genetics environment in order to understand the underlying biological mechanisms (Gao
and Alvo, 2005b). In the gene expression data of Drosophila melanogaster (Jin et al.,
2001) for example, there are 24 cDNA micro arrays, 6 for each combination of two geno-
types (Oregon R and Samarkand) and two sexes. As each array used two different dyes,
there were in total 48 separate labeling reactions. Focusing on the individual expression
level of a gene and its relationship with genotypes and sexes, the objective of the study
was to identify genes whose expression levels are affected by the interaction between
the two factors. For such data, the assumption of normality for the error terms is not
warranted and consequently, nonparametric procedures are needed. We shall consider a
nonparametric test for interaction based on the row ranks and column ranks of the data.

6.4.1. Tests for Interaction: More Than One Observation per


Cell
Suppose now we have data from a double array with I rows and J columns with an equal
number n > 1 of observations in each cell, Xijk , i = 1, , , .I, j = 1, . . . , J, k = 1, . . . , n.
We first note that
 
1 
J n
Rijk 1
sgn (xijk − xij  k ) = − ,
nJ + 1 j  =1 k =1 nJ + 1 2

where Rijk is the rank of Xijk among the ith row. Similarly,
 
1 
I n
Cijk 1
sgn (xijk − xi jk ) = − ,
nI + 1 i =1 k =1 nI + 1 2

where Cijk is the rank of Xijk among the jth column. Placing this problem within
the context of a smooth model, the composite log likelihood function viewed from the
perspective of rows only is proportional to

 
n 
J 
n

l (θ) ∼ θij sgn (xijk − xij  k ) − n2 J K (θ)
ij k=1 j  =1 k =1
 n  
(nJ + 1)
∼ θij Rijk − − n2 J K (θ)
ij k=1
2
 n
Rijk
∼ θij − nK (θ) .
ij k=1
nJ + 1

131
6. Multi-Sample Problems

Similarly, the composite log likelihood function viewed from the perspective of columns
only is proportional to

 n
Cijk
l (θ) ∼ θij − nK (θ) .
ij k=1
nI + 1

Let
Rijk Cijk
+xijk = .
nJ + 1 nI + 1
Consequently, the composite log likelihood from the perspective of both rows and
columns is proportional to the sum

 n     n
Rijk Cijk
θij + − K (θ) ∼ θij xijk − nK (θ)
ij k=1
nJ + 1 nI + 1 ij k=1


∼ n θij x̄ij. − K (θ) .
ij

Using dots to denote an average over that index, set the parameter space

θij = θ.. + αi + βj + γij

where
1 
αi = (θi. − θ.. ) , βj = (θ.j − θ.. ) , θ.. = θij
IJ
and
γij = θij − θi. − θ.j + θ..
Here, αi , βj , γij represent respectively the row, column, and interaction effects. It can
be seen that

ij θij x̄ij. =

   
θ.. x̄... + J αi (x̄i.. − x̄... ) + I βj (x̄.j. − x̄... ) + γij (x̄ij. − x̄i.. − x̄.j. + x̄... )
i j ij

We wish to test the null hypothesis of no interaction, that is

H0 : γij = 0, for all i.j.

The score vector U has ij component equal to

∂l (θ) ∂K (θ)
= [x̄ij. − x̄i.. − x̄.j. + x̄... ] −
∂γij ∂γij

132
6. Multi-Sample Problems

Consequently, the test statistic to test the hypothesis of no interaction is given by

n−1 U  Σ− U (6.10)

where Σ− represents the generalized inverse of the variance-covariance matrix of U .


The asymptotic distribution of this statistic under the null hypothesis was shown to
be a χ2(I−1)(J−1) (Gao and Alvo (2005a)). As well, the distribution under contiguous
alternatives was shown to be χ2(I−1)(J−1) (δ) where δ is the noncentrality parameter. The
demonstration of these results makes use of results in Hajek (1968).
Example 6.4. Consider the gene expression data of Drosophila melanogaster of Jin
et al. (2001). The gene fs(1) k10 is known to be expressed in reproductive systems
and its expression level was reportedly affected by the gender and genotype interaction.
The row-column statistic was applied to this data to account for the genotype, the
gender, and the genotype-gender interaction. It was found that the interaction effect
was statistically significant with a p-value equal to 0.004. The parametric F statistic
and the aligned rank transform using the residuals yielded similar results. In order to
illustrate the robustness of the nonparametric procedures, the analyses were redone with
the first observation changed to an arbitrarily large number. The performance of the F
statistic was severely affected and yielded a nonsignificant result. On the other hand,
the nonparametric procedures were unaffected.
Next, we recall that in an example of a 3 × 4 factorial design considered by Box and
Cox (1964) it was claimed that only after application of a nonlinear transformation can
the error term be stabilized and the data made suitable for standard statistical analysis.
We applied the row-column procedure to the un-transformed data and obtained a p-value
of 0.44. Thus the hypothesis of no interaction was not rejected, a finding that concurs
with Box and Cox. The aligned test on the other hand yielded a p-value of 0.02 which
indicates the presence of interaction. However, for the transformed data, the aligned
test with a p-value of 0.45 did not reject the null hypothesis.

Chapter Notes
1. Gao and Alvo (2005b) provide a brief historical look at the analysis of unbalanced
two-way layout with interaction effects. Using the notion of a weighted rank, they
present tests for both main effects and interaction effects. In addition, there is
a discussion of the asymptotic relative efficiency of the proposed tests relative to
the parametric F test. Various simulations further exemplify the power of the
proposed tests. In a specific application, it is shown that the test statistic is the
most robust in the presence of extreme outliers compared to other procedures.
2. Gao et al. (2008) also consider nonparametric multiple comparison procedures for
unbalanced one-way factorial designs whereas Gao and Alvo (2008) treat nonpara-
metric multiple comparison procedures for unbalanced two-way layouts.

133
6. Multi-Sample Problems

3. Alvo and Pan (1997) considered the two-way layout with an ordered alternative
using the tools of the unified theory. It was seen that the Spearman’s distance
induced the Page statistic (Page, 1963). The statistic induced by Hamming’s
distance was new (see also Schach (1979)). Alvo and Cabilio (1995) considered
the two-way layout with ordered alternatives when the data within blocks may be
incomplete and they obtained generalizations of the Page and Jonckheere statistics.
Cabilio and Peng (2008) considered a multiple comparison procedure for ordered
alternatives when the data are incomplete which maintains the experimentwise
error rate at a preassigned level.

6.5. Exercises
Exercise 6.1. Consider the randomized block experiment given by the model

Xij = bi + τj + eij , i = 1, . . . , n; j = 1, . . . , t

where bi is a block effect, τj is a treatment effect, and {eij } are independent identically
distributed error terns having a continuous distribution. Use the unified theory of hy-
pothesis testing to obtain the statistic that corresponds to the Spearman measure of
distance in order to test
H 0 : τ1 = τ2 = . . . = τ t
against the ordered alternative

H 1 : τ1 ≤ τ2 ≤ . . . ≤ τ t

with at least one strict inequality. (See Alvo and Cabilio (1995).)

Exercise 6.2. Suppose that one observes at times t1 < t2 < . . . < tk a random sample
of ni binary variables {yij } taking values 1 or 0 with unknown probabilities θi , 1 − θi
respectively. Use the unified theory of hypothesis testing to obtain the statistic that
corresponds to the Spearman measure of distance in order to test the null hypothesis of
homogeneity
H 0 : θ1 = θ2 = . . . = θ k
against the ordered alternative

H 1 : θ1 ≤ θ2 ≤ . . . ≤ θ k

with at least one strict inequality. (See Alvo and Berthelot (2012).)

134
6. Multi-Sample Problems

Exercise 6.3. Using the following coded data on drug toxicity during 5 hours, test for
an umbrella alternative when the peak toxicity is assumed to be during the 12:00–13:00
time period.

10:00–11:00 11:00–12:00 12:00–13:00 13:00–14:00 14:00–15:00


3.8 4.6 6.3 5.1 1.5
5.5 7.0 8.2 6.6 3.0
4.9 5.9 7.4 7.7 4.1

Exercise 6.4. We are interested in comparing 3 energy drinks in terms of endurance.


The response is time to exhaustion on a treadmill. Test for differences in the mean
between the three drinks.

Drink 1 Drink 2 Drink 3


42 48 62
36 34 48
54 56 75
44 46 52
28 32 44
45 50 65

135
7. Tests for Trend and Association
In this chapter, we consider additional applications of the smooth model paradigm de-
scribed earlier in Chapter 4. We begin by considering tests for trend. We then proceed
with the study of the one-sample test for a randomized block design. We obtain a dif-
ferent proof of the asymptotic distribution of Friedman’s statistic based on Alvo (2016)
who developed a likelihood function approach for the analysis of ranking data. Further,
we derive a test statistic for the two-sample problem as well as for problems involving
various two-way experimental designs. We exploit the parametric paradigm further by
introducing the use of penalized likelihood in order to gain further insight into the data.
Specifically, if judges provide rankings of t objects, penalized likelihood enables us to
focus on those objects which exhibit the greatest differences.

7.1. Tests for Trend


Suppose that a series of observations of a random variable (concentration, unit well yield,
biologic diversity, etc.) have been collected over some period of time. We would like to
determine if their values exhibit either an increasing or a decreasing trend. In statistical
terms this translates into a determination of whether the probability distribution from
which they arise has changed over time. We would also like to describe the amount or
rate of that change, in terms of changes in some central value of the distribution such as
a mean or a median. The null hypothesis is that there is no trend. However, any given
test brings with it a precise mathematical definition of what is meant by “no trend,”
including a set of background assumptions usually related to a type of distribution and
to the serial correlation.
It is often of interest to test for a monotone trend in a series of data. For example,
one may be interested in testing whether or not the pH in a river is decreasing or
increasing over time. In such a case, the underlying distribution of the data is seldom
known especially for small samples. We may develop, without loss in generality, a
nonparametric test for a monotone increasing trend.
Let X1 , . . . , Xn be a random sample from some continuous distribution and consider
the model in (5.7):

π (h (x, y) ; θ) = exp [θ I(x < y) − K(θ)] g(h(x, y))

© Springer Nature Switzerland AG 2018 137


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_7
7. Tests for Trend and Association

where g(h(x, y)) = 12 , x = y and I (A) = 1 if the event A occurs and = 0 otherwise with

h (x, y) = I(x < y)

eθ + 1
eK(θ) = .
2
The kernel h (x, y) may be used to compare any two values along the sequence of the
observations and is a measure of the slope. The case when θ = 0 corresponds to the
situation when there is trend. We may construct the composite likelihood function
"  "
L(θi ) = π((h (xi , xj )) ; θi ) = exp[θi j=i I(xi <xj )−(n−1)K(θi )] (g (h (xi , xj ))) .
j
=i j
=i

The choice of kernel function is motivated by the fact that in testing for an increasing
trend, we should focus on observations to the right of the present observation. It is seen
that 
I (Xi < Xj ) = Ri − 1,
j
=i

where Ri is the rank of Xi among X1 , . . . , Xn . Hence, the log of the composite likelihood
is proportional to

"
n
(θ) = log Li (θi )
i=1

n 
n 
∼ θi R i − θi − (n − 1) K (θi ) .
i=1 i=1

Suppose now that we define  


n+1
θi = β i − .
2
It follows that n  
 n+1 
(θ) ∼ β i− Ri − (n − 1) K(θi )
i=1
2
so that the test of no trend
H0 : θi = 0 for all i
is now a test that β = 0. Differentiating with respect to β and setting it equal to 0 lead
equivalently to the Rao score test statistic
n 
 
n+1
Sn = i− Ri (7.1)
i=1
2

138
7. Tests for Trend and Association

which rejects for large absolute values. It should be noted that with this same approach
we can also test for quadratic trends by choosing
 2
n+1
θi = β i − .
2

The statistic in (7.1) is a well-known test of trend. Since it is a linear rank statistic, its
asymptotic distribution can be easily obtained from Theorem 3.2. In fact, we have that
for large n n Ri
i=1 i − n+1
n+1 L
2

→→ N (0, 1)
σ
where  2
1  n+1 n (n2 − 1)
2
σ = i− =
12 2 12
Yu et al. (2002) obtained a generalization of the trend statistic in the presence of ties.

Example 7.1. In Appendix A.7, precipitation data for Saint John, New Brunswick,
Canada was analyzed for the period 1894–1991 using the Spearman statistic (Alvo and
Cabilio (1994)). The Z-score for St John was calculated to be 2.08 indicating there is
an increasing trend.

7.2. Problems of Concordance


We recall Section 4.3 and consider the one-sample ranking problem whereby a group of
judges are each asked to rank a set of t objects in accordance with some criterion. Let
P = {ν j , j = 1, . . . , t!} be the space of all t! permutations of the integers 1, 2, . . . , t and
let the probability mass distribution defined on P be given by

p = (p1 , . . . , pt! ) ,

where pj = Pr (ν j ). Conceptually, each judge selects a ranking ν in accordance with


the probability mass distribution p. We are interested in testing the null hypothesis that
each of the rankings is selected with equal probability, that is

H0 : p = p0 vs H1 : p = p0 . (7.2)

where p0 = t!1 1.
Define a k-dimensional vector score function X (ν) on the space P and let its smooth
probability mass function be given as

1
π(xj ; θ) = exp (θ xj − K(θ)) , j = 1, . . . , t! (7.3)
t!

139
7. Tests for Trend and Association

where K (θ) is a normalizing constant and θ is a k-dimensional vector. Since


t1
π (xj ; θ) = 1
j=1

it can be seen that K(0) = 0 and hence the hypotheses in (7.2) are equivalent to

H0 : θ = 0 vs H1 : θ = 0. (7.4)

It follows that the log likelihood function is proportional to

l (θ) ∼ n [θ  η̂ − K(θ)] ,

where


t!
nj
η̂ = xj p̂nj , p̂nj =
j=1
n

and nj represents the number of observed occurrences of the ranking ν j . The Rao score
statistic evaluated at θ = 0 is
∂ 
U (θ; X) = n [θ η̂ − K(0)]
∂θ
 

= n η̂ − K(0)
∂θ

whereas the information matrix is


 
∂2
I(θ) = −n K(0) .
∂θ2

The test then rejects the null hypothesis whenever


   
∂ −1 ∂
n 2
η̂ − K (0) I (0) η̂ − K (0) > χ2f (α) ,
∂θ ∂θ

where χ2f (α) is the upper 100 (1 − α) % critical value of a chi-square distribution with
f = rank(I (θ)) degrees of freedom. We note that the test just obtained is the locally
most powerful test of H0 . In the next section, we specialize this test statistic and consider
the score functions of Spearman and Kendall.

140
7. Tests for Trend and Association

7.2.1. Application Using Spearman Scores


In this section, we consider the Spearman score function which is defined to be the
t-dimensional random vector of adjusted ranks
 
t+1 t+1
xj = νj (1) − , . . . , νj (t) − , j = 1, . . . , t!
2 2

Let T S = (xj ) be the t × t! matrix of possible values of X. In the next theorem, we


consider properties of the Spearman scores.

Theorem 7.1. Under the null hypothesis for the Spearman scores,

(a) the covariance function of X is given by

1 t+1
Cov (X) = T S T S = [tI − J t ] . (7.5)
t! 12

(b) a generalized inverse of the covariance function is given by

− 12
(T S T S ) = [I + J t ] . (7.6)
t (t + 1)

Proof. We refer the reader to (Alvo and Yu (2014), Chapter 4).

Next, we demonstrate that the Rao score statistic is the well-known Friedman test
(Friedman, 1937).

Theorem 7.2. Under the null hypothesis, the Rao score statistic is asymptotically χ2t−1
and is given by
t  2
12n  t+1
W = R̄i − , (7.7)
t (t + 1) i=1 2

where R̄i is the average of the ranks assigned to the ith object.

Proof. It can be seen that



t!
η̂ = xj p̂nj
j=1
= T S p̂n .

where
p̂n = (p̂nj )

141
7. Tests for Trend and Association

is the t! × 1 vector of relative frequencies. Since under the null hypothesis, θ = 0,

Eθ [X] = 0,

it follows that U (θ; X) = nT S p̂n and


n
I (θ) = T S T S
t!
 
t+1
=n [tI − J t ] .
12

Consequently, the test statistic becomes


 
 −1  12
n [T S p̂n ] I
2
(θ) [T S p̂n ] = n (T S p̂n )
2
[I + J t ] (T S p̂n )
nt (t + 1)
 
12n
= (T S p̂n ) (T S p̂n )
t (t + 1)
  t  2
12n t+1
= R̄i − . (7.8)
t (t + 1) i=1 2

For the last equality, we used the fact that


 
t+1
T S p̂n = R̄ − 1 ,
2

where R̄= R̄1 , . . . , R̄t is the vector of the average ranks assigned to the objects.

In the next section, we consider the Kendall scores as a second application.

7.2.2. Application Using Kendall Scores


Suppose now that the random vector X takes values (tK (ν))q where the qth element is
given by
(tK (ν))q = sgn [ν (j) − ν (i)]

for q = (i − 1) t − 2i + (j − i) , 1 ≤ i < j ≤ t . This is the Kendall score function whose
(t2 ) × t! matrix of possible values is given by

T K = (tK (ν 1 ) , . . . , tK (ν t! ))

Theorem 7.3. Under the null hypothesis for the Kendall scores,

142
7. Tests for Trend and Association

(a) the covariance function of X is given by

1
Cov (X) = T K T K
t!

whose entries A (s, s , t, t ) = t!1 Σν sgn (ν (s) − ν (t)) sgn (ν (s ) − ν (t )) are given by


⎪ 0 s = s , t = t


⎨1 s = s , t = t 
A (s, s , t, t ) = 1 .

⎪ s = s , t = t

⎪ 3
⎩ 1
−3 s = t , s = t
t−1
Moreover, the eigenvalues of Cov (X) are 13 , t+1
3
with multiplicities 2 , (t − 1)
respectively.

(b) the inverse matrix has entries of the form




⎪ 0 s = s , t = t


⎨ 3 s = s , t = t 
B (s, s , t, t ) = (t − 1) t+1
.

⎪ −1 s = s , t = t



1 s = t , s = t

Proof. Part 1 follows from Lemma 4.1 in (Alvo and Yu (2014), p. 58). Part 2 follows by
direct calculation.

As an example, consider the case t = 3. Then,


⎛ ⎞
sgn(ν(2) − ν(1))
X(ν) = ⎝sgn(ν(3) − ν(1))⎠
sgn(ν(3) − ν(2))

It can be seen that,


⎛ ⎞
6 2 −2
T K T K = ⎝ 2 6 2 ⎠
−2 2 6
and ⎛ ⎞
1
4
− 18 1
8
−1
(T K T K ) =⎝ − 18 1
4
− 18 ⎠.
1
8
− 18 1
4

The inverse matrix can be readily computed even for values of t = 10.

143
7. Tests for Trend and Association

Theorem 7.4. Under the null hypothesis, the Rao score statistic for the Kendall scores
is asymptotically χ2t as n → ∞ and is given by
(2 )
−1
n (T K p̂n ) (T K T K ) (T K p̂n ) . (7.9)

Proof. The proof follows as for the Spearman statistic.


An alternate form of the asymptotic distribution is given in Alvo et al. (1982):

1 ( )
n (T K p̂n ) (T K p̂n ) −
L
→ (t + 1) χ 2
t−1 + χ (2 ) − 1
2
t−1
3 (t2 )

where the left-hand side can be calculated as



(2#i − n)2
.
n (t2 )

The summation is over all (t2 ) pairs of objects and #i is the number of judges whose
ranking of the pair i of objects agrees with the ordering of the same pair in a criterion
ranking such as the natural ranking. The distribution of the Kendall statistic (7.9)
is simpler though its form is somewhat more complicated. In the alternate form, the
reverse is true.
In this section, we have seen that we can derive some well-known statistics through
the parametric paradigm and that these are locally most powerful. We proceed next to
show that a similar result can be obtained using Hamming score functions.

7.2.3. Application Using Hamming Scores


Suppose now that the random vector X takes values (tH (ν))q where the qth element is
given by  
j
(tH (ν))q = I [ν (i) ≤ j] −
t

for q = (i − 1) t − 2i + (j − i) , 1 ≤ i < j ≤ t. This is the Hamming score function
whose t2 × t! matrix of possible values is

T H = (tH (ν 1 ) , . . . , tH (ν t! )) .

Theorem 7.5. Under the null hypothesis for the Hamming scores,
(a) the covariance function of X is given by
   
1 J J
Γ = Cov (X) = I− ⊗ I−
t−1 t t

and Γ has a distinct eigenvalue 1


t−1
with multiplicity (t − 1)2 .

144
7. Tests for Trend and Association

(b) The asymptotic distribution of the test statistic is

(t − 1) n (T H p̂n ) (T H p̂n ) −
L
→ χ2(t−1)2 .

Moreover, the statistic can be calculated as



(T H p̂n ) (T H p̂n ) = Di2 (l) − n2
l i

where Di (l) is the number of rankers who assign rank l to object i.

This test was first introduced by Anderson (1959) and rediscovered by Kannemann
(1976). Schach (1979) obtained the asymptotic distribution of the statistic based on
Hamming distance under the null hypothesis and under contiguous alternatives making
use of Le Cam’s third lemma. Alvo and Cabilio (1998) extended the statistic to include
various block designs.

7.3. The Two-Sample Ranking Problem


We may now consider the two-sample ranking problem. For simplicity, we shall make
use of the Spearman scores throughout. Let X 1 , X 2 be two independent random vectors
whose distributions, as in the one-sample case, are expressed for simplicity as

π (xj ; θ l ) = exp {θ l xj − K (θ l )} pl (j) , j = 1, . . . , t!, l = 1, 2,

where θ l = (θl1 , . . . , θlt ) represents the vector of parameters for population l. We are
interested in testing
H0 : θ 1 = θ 2 vs H1 : θ 1 = θ 2 .
The probability distribution {pl (j)} represents an unspecified null situation. Define
 
nl1 nlt!
p̂l = ,..., ,
nl nl

where nij represents the number of occurrences of the ranking ν j in sample l.



Also, for l = 1, 2, set j nij ≡ nl , γ = θ 1 − θ 2 and

θ l = m + bl γ,

where
n1 θ 1 + n 2 θ 2 n2, n1
m= , b1 = , b2 = − .
n1 + n2 n1 + n2 n1 + n2

145
7. Tests for Trend and Association

Let Σl be the covariance matrix of X l under the null hypothesis defined as

Σl = Πl − pl pl ,

where Πl = diag (pl (1) , . . . , pl (t!)) and pl = (pl (1) , . . . , pl (t!)) . The logarithm of the
likelihood L as a function of (m, γ) is proportional to


2 
t!
+ ,
log L (m, γ) ∼ nlj (m + bl γ) xj − K (θ l ) .
l=1 j=1

Theorem 7.6. Consider the two-sample ranking problem whereby we wish to test

H0 : θ 1 = θ 2 vs H1 : θ 1 = θ 2 .

The Rao score test statistic is given by

n (T S p̂1 − T S p̂2 ) D̂ (T S p̂1 − T S p̂2 ) . (7.10)

It has asymptotically a χ2f whenever nl /n → λl > 0 as n → ∞, where n = n1 + n2 .


Here D̂ is the Moore-Penrose inverse of T S Σ̂T S and Σ̂ is a consistent estimator of
Σ= Σ λ1
1

λ2
2
and f is the rank of D̂.

Proof. The Rao score vector evaluated under the null hypothesis is given by
 
∂ log L (m, γ) n 1 n2
= (T S p̂1 − T S p̂2 )
∂γr n1 + n2

and consequently, the Rao score test statistic becomes

n (T S p̂1 − T S p̂2 ) D̂ (T S p̂1 − T S p̂2 ) .

The result follows from the general theory in Section 7.1.

The result of this section was first obtained in Feigin and Alvo (1986) using notions
of diversity. It is derived presently through the parametric paradigm. See Chapter 4.2
of Feigin and Alvo (1986) for a discussion on the efficient calculation of the test statistic.
The parametric paradigm can also be used to deal with the two-sample mixture
problem using a distribution expressed as

π (X 1 , X 2 ; θ 1 , θ 2 ) = λπ (X 1 ; θ 1 ) + (1 − λ) π (X 2 ; θ 2 ) , 0 < λ < 1.

In that case, the use of the EM algorithm can provide estimates of the parameters (see
Casella and George (1992)).

146
7. Tests for Trend and Association

7.4. The Use of Penalized Likelihood in Tests of


Concordance: One and Two Group Cases
In the previous sections, it was possible to derive well-known test statistics for the one-
and two-sample ranking problems through the parametric paradigm. In this section,
we make use of the parametric paradigm to obtain new results for the ranking prob-
lems. Specifically, we consider a negative penalized likelihood function defined to be the
negative likelihood function subject to a constraint on the parameters which is then min-
imized with respect to the parameter. This approach yields further insight into ranking
problems.
For the one-sample ranking problem, let
t!

 t

Λ(θ, c) = −θ nj xj + nK(θ) + λ( θi2 − c) (7.11)
j=1 i=1

represent the penalizing function for some prescribed values of the constant c. We shall
assume for simplicity that xj  = 1. When t is large (say t ≥ 10), the computation of
the exact value of the normalizing constant K(θ) involves a summation of t! objects.
McCullagh (1993) noted the resemblance of (7.3) to the continuous von Mises-Fisher
density
t−3
θ 2
f (x; θ) = t−3 t−1
exp (θ  x) ,
2 t!I (θ)Γ( 2 )
2 t−3
2

where θ is the norm of θ, x is on the unit sphere, and Iυ (z) is the modified Bessel
function of the first kind given by



1 z 2k+ν
Iυ (z) = .
k=0
Γ(k + 1)Γ(υ + k + 1) 2

This seems to suggest the approximation of the constant K (θ) by


t−3
1 θ 2
exp (−K(θ)) ≈ · t−3 .
t! 2 2 I t−3 (θ)Γ( t−1 )
2 2

The accuracy of this approximation is very good as discussed further in Chapter 11 in


connection with Bayesian models for ranking data.
We proceed to find the maximum penalized likelihood estimation for θ using algo-
rithms implemented in MATLAB that converge very fast. Following the estimation of
θ, we apply the basic bootstrap method in order to determine the distribution of θ. We
sample n rankings with replacement from the observed data. Then we find the maximum

147
7. Tests for Trend and Association

Table 7.1.: Combined data on leisure preferences


Rankings (123) (132) (213) (231) (312) (321)
Frequencies 1 1 1 5 7 12

likelihood estimate of θ from each bootstrap sample. Repeating this procedure 10,000
times leads to the bootstrap distribution for θ. In this way, we can draw useful inference
from the distribution θ and in particular construct two-sided confidence intervals for its
components. We applied this to a data set with t = 3. Define the probabilities of the
rankings

p1 = P {123} , p2 = P {132} , p3 = P {213} , p4 = P {231} , p5 = P {312} , p6 = P {321}

and consider the interpretation of the θ vector. Since


⎡ ⎤
p5 + p6 − p1 − p2
 
θ  T S p = θ1 θ2 θ3 ⎣ p 2 + p 4 − p 3 − p 5 ⎦ ,
p1 + p3 − p4 − p6

we note that θ1 weights the difference between

p5 + p6 = Pr(giving rank 3 to item 1)

and
p1 + p2 = Pr(giving rank 1 to item 1)
which compares the probabilities of assigning the lowest and highest rank to object 1.
The other components make similar comparisons for the other objects.
Example 7.2. Sutton Data (One-Sample Case)
C. Sutton considered in her 1976 thesis, the leisure preferences and attitudes on retire-
ment of the elderly for 14 white and 13 black females in the age group 70–79 years.
Each individual was asked: with which sex do you wish to spend your leisure time?
Each female was asked to rank the three responses: male(s), female(s), or both, as-
signing rank 1 for the most desired and 3 for the least desired. The first object in the
ranking corresponds to “male,” the second to “female,” and the third to “both.” To illus-
trate the approach in the one-sample case, we combined the data from the two groups
as in Table 7.1.
We applied the method of penalized likelihood in this situation and the results are
shown in Table 7.2. To better illustrate our result, we rearrange our result (unconstrained
θ, c = 1) and data as Table 7.3. It can be seen that θ1 is the largest coefficient and
object 1 (Male) shows the greatest difference between the number of judges choosing
rank 1 or rank 3 which implies that the judges dislike spending leisure time with males

148
7. Tests for Trend and Association

Table 7.2.: Penalized likelihood for the combined data


c θ1 θ2 θ3 Λ(θ, c)
0.5 0.53 -0.06 -0.47 50.00
1 0.75 -0.09 -0.66 50.36
2 1.06 -0.12 -0.93 54.62

Table 7.3.: The combined data reexpressed


Object Number of Action Difference c=1
judges
2 assign rank 1
Male -17 θ1 0.75
19 assign rank 3
8 assign rank 1
Female 2 θ2 -0.09
6 assign rank 3
17 assign rank 1
Both 15 θ3 -0.66
2 assign rank 3

the most. For object 3 (Both), the greater value of negative θ3 implies the judges prefer
to spend leisure time with both sexes the most. θ2 is close to zero and we deduce the
judges show no strong preference on Female. This is consistent with the hypothesis that
θ close to zero means randomness. To conclude, the results also show that θi weights
the difference in probability between assigning the lowest and the top rank to object i.
A negative value of θi means the judges prefer object i more whereas a positive θi means
the judges are more likely to assign a lower rank to object i.
We plot the bootstrap distribution of θ in Figure 7.1. For H0 : θi = 0, we see that θ1
and θ3 are significantly different from 0 whereas θ2 is not. We also see that the bootstrap
distributions are not entirely bell shaped leading us to conclude that a traditional t-test
method may not be appropriate in this case.
For the Kendall score, we consider once again the Sutton data (t = 3) and apply a
penalized likelihood approach. The results are exhibited in Table 7.4.
We rearrange the Sutton data focusing on paired comparisons and the results (c = 1)
are displayed in Table 7.5. First, we note that all the θi s are negative. This is consistent

Table 7.4.: Penalized likelihood using the Kendall score function for the Sutton data
Paired comparison Choice of c
no constraint
object i object j c = 0.5 c = 1 c = 2 c = 10
1 2 θ1 -0.35 -0.49 -0.70 -1.56 -0.60
1 3 θ2 -0.56 -0.80 -1.13 -2.53 -0.97
2 3 θ3 -0.24 -0.34 -0.48 -1.08 -0.41
Λ(θ, c) 42.79 40.17 40.20 127.76 39.59

149
7. Tests for Trend and Association

a Bootstrapping on θ for Sutton Data Bootstrapping on θ for Sutton Data


0.2 Sample Mean 0.18
Lower 5% CI Sample Mean
0.18 0.16
Lower 5% CI Upper 5% CI Upper 5% CI
0.16
0.14
0.14
0.12
0.12
0.1
0.1
0.08
0.08
0.06
0.06

0.04 0.04

0.02 0.02

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1
Distribution for θ1 Distribution for θ3

b Bootstrapping on θ for Sutton Data


0.2
Sample Mean
0.18
Lower 5% CI Upper 5% CI
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
−0.6 −0.4 −0.2 0 0.2 0.4
Distribution for θ2

Figure 7.1.: The distribution of θ for Sutton data by bootstrap method

Table 7.5.: Paired comparison for the Sutton data and the estimation of θ
object i object j Number of judges Paired comparison θ
7 more prefer 1
1 2 -0.49
20 more prefer 2
3 more prefer 1
1 3 -0.80
24 more prefer 3
9 more prefer 2
2 3 -0.34
18 more prefer 3

150
7. Tests for Trend and Association

Table 7.6.: Penalized likelihood results for the Sutton data


c γ1 γ2 γ3 m1 m2 m3 Λ(m, γ)
0.5 0.34 -0.57 0.24 0.59 -0.07 -0.52 46.88
1 0.48 -0.81 0.34 0.58 -0.06 -0.52 46.38
2 0.67 -1.15 0.48 0.57 -0.06 -0.51 46.46
10 1.50 -2.57 1.07 0.47 -0.04 -0.43 58.73

Table 7.7.: The Sutton data on leisure preferences


Rankings (123) (132) (213) (231) (312) (321)
Frequencies for white females 0 0 1 0 7 6
Frequencies for black females 1 1 0 5 0 6

with our interpretations. The judges show a strong preference for Males to Both and
Males to Females. They least prefer Females to Both. We may conclude that the θi s
represent well the paired preferences among the judges. We applied penalized likelihood
in this situation and the results are shown in Table 7.6.

Example 7.3. Sutton Data (Two-Sample Case)


For the two-sample case, we consider a penalized likelihood approach to determine those
components of γ which most separate the populations. Hence, we consider minimizing
with respect to parameters m and γ the function:


2 
t! 
2 
t
Λ(m, γ) = − (m + bl γ) nlj xlj + nl K(m + bl γ) + λ( γi2 − c)
l=1 j=1 l=1 i=1

for some prescribed values of the constant c and λ. We continue to use the approximation
to the normalizing constant from the von Mises-Fisher distribution to approximate K(θ).

Here γi shows the difference between the two population’s preference on object i.
A negative γi means that population 1 shows more preference on object i compared to
population 2. A positive γi means that population 2 shows more preference on object i
compared to population 1. For γi close to zero, there is no difference between the two
populations on that object. As we shall see, this interpretation is consistent with the
results in the real data applications. From the definition of m, we know that m is the
common part of θ 1 and θ 2 . More specifically, m is the weight average of θ 1 and θ 2
taking into account the sample sizes of the populations.
As an application consider the Sutton data (t = 3) found in Table 7.7. Rearranging
the results for c = 1 we have the original data in Table 7.8. First, it is seen that m
is just like the θ’s in the one-sample problem. For example, m3 is the smallest value
and the whole population prefers object “Both” best. m3 is the largest and the whole

151
7. Tests for Trend and Association

Table 7.8.: The Sutton data and the estimation of m, γ


Object: #white #black Sum Action γ m
females females
0 2 2 give rank 1
Male 0.48 0.58
13 6 19 give rank 3
8 0 8 give rank 1
Female -0.81 -0.06
0 6 6 give rank 3
6 11 17 give rank 1
Both 0.34 -0.52
1 1 2 give rank 3

population mostly dislikes object “Male.” This is not surprising since we know that m is
the common part of θ 1 and θ 2 . For the parameter γ, we note that white females prefer
to spend leisure time with Females (8 individuals assign rank 1) whereas black females
do not (6 individuals assign rank 3). We see that γ2 is negative and largest in absolute
value. There is a significant difference of opinion with respect to object 2, Female. For
objects “Male” and “Both,” black females prefer them to white females. To conclude,
the results are consistent with the interpretation of m and γ.
We conclude that the use of the parametric paradigm provided more insight on the
objects being ranked. Further details for the penalized likelihood approach are found in
Alvo and Xu (2017).

7.5. Design Problems


We begin by defining the notion of compatibility introduced by Alvo and Cabilio (1999)
which will be useful for the study of incomplete block designs. We then recall some
asymptotic results found in Alvo and Yu (2014) and conclude by relating those to the
score statistic obtained by embedding the problem in a parametric framework. Illustra-
tions are given only for the Spearman distance.

7.5.1. Compatibility
Compatibility is a concept that was introduced in connection with incomplete rankings.
Specifically, suppose that μ = (μ (1) , . . . , μ (t)) represents a complete ranking of t ob-
jects and that μ∗ = (μ∗ (o1 ) , . . . , μ∗ (ok )) represents an incomplete ranking of a subset k
of these objects where o1 , . . . , ok represent the labels of the objects ranked. Alternatively,
for an incomplete ranking, we may retain the t − dimensional vector notation and indi-
cate by a ” − ” an unranked object. Hence, the incomplete ranking μ∗ = (2, −, 3, 4, 1)
indicates that among the t = 5 objects, only object “2” is not ranked. In this notation,
complete and incomplete rankings are of the same length t.

152
7. Tests for Trend and Association

Definition 7.1. A complete ranking μ of t objects is said to be compatible with an


incomplete ranking μ∗ of a subset k of these objects, 2 ≤ k ≤ t, if the relative ranking
of every pair of objects ranked in μ∗ coincides with their relative ranking in μ.

As an example, the incomplete ranking μ∗ = (2, −, 3, 4, 1) gives rise to a class of


order preserving complete compatible rankings. Denoting by C (μ∗ ) the set of complete
rankings compatible with μ∗ , we have that

C (μ∗ ) = {(2, 5, 3, 4, 1) , (2, 4, 3, 5, 1) , (2, 3, 4, 5, 1) , (3, 2, 4, 5, 1) , (3, 1, 4, 5, 2)} .

The total number of complete rankings of t objects compatible with an incomplete


t!
ranking of a subset of k objects is given by k! . This follows from the fact that there
t
are (k ) ways of choosing k integers for the ranked objects, one way in placing them to
preserve the order and then (t − k)! ways of rearranging the remaining integers. The
product is thus
t!
a ≡ tk (t − k)! = .
k!
In general, we may define a matrix of compatibility C so that each column contains
exactly a ones and each row contains exactly one 1.
For a given pattern of t−kh missing observations, each permutation of the kh objects
has its own distinct set of t!/kh ! compatible t-rankings, so that each column of C h
contains exactly t!/kh ! 1’s, and each row exactly one 1. For any incomplete kh −ranking
μ∗ , this definition can be shown to lead to an analogue of T given by

kh !
T ∗h = T Ch (7.12)
t!
whose columns are the score vectors of

μ∗ = (μ∗ (1) , μ∗ (2) , . . . , μ∗ (kh ))

as μ∗ ranges over each of the kh ! permutations (Alvo and Cabilio, 1991).

Example 7.4. Let t = 3, kh = 2. The complete rankings associated with the rows are
in the order (123), (132), (213), (231), (312), (321). For the incomplete rankings (12_),
(21_) indexing the columns, the associated compatibility matrix C h is
⎡ ⎤
1 0
⎢ ⎥
⎢ 1 0 ⎥
⎢ ⎥
⎢ 0 1 ⎥
Ch = ⎢ ⎥.
⎢ 1 0 ⎥
⎢ ⎥
⎣ 0 1 ⎦
0 1

153
7. Tests for Trend and Association

In the Spearman case, for the incomplete rankings (12_), (21_)

⎡ ⎤
1 0
⎡ ⎤⎢ ⎥ ⎡
2 ⎤
1 ⎢ ⎥
1 0
−1 −1 0 0 1 ⎢ ⎥ − 23
1 ⎢ 0 1 ⎥ ⎣ 2 3
T ∗h = ⎣ 0 1 −1 1 −1 ⎦
0 ⎢ ⎥= − 2 ⎦
.
3 ⎢ 1 0 ⎥ 3 3
1 0 1 −1 0 −1 ⎢ ⎥ 0 0
⎣ 0 1 ⎦
0 1
For completeness, we may also extend the notion of compatibility to tied rankings
defined as follows.
Definition 7.2. A tied ordering of t objects is a partition into e sets, 1 ≤ e ≤ t, each
containing di objects, d1 + d2 + . . . + de = t, so that the di objects in each set share
the same rank i, 1 ≤ i ≤ e. Such a pattern is denoted δ = (d1 , d2 , . . . , de ). The ranking
denoted by μδ = (μδ (1) , . . . , μδ (t)) resulting from such an ordering is called a tied
ranking and is one of d1 !,d2t!!,...,de ! possible permutations.
For example, if t = 3 objects are ranked, it may happen that objects 1 and 2 are
equally preferred to object 3. Consequently, the rankings (1, 2, 3) and (2, 1, 3) would
both be plausible and should be placed in a “compatibility” class. The average of the
rankings in the compatibility class which results from the use of the Spearman distance
would then yield the ranking

1
[(1, 2, 3) + (2, 1, 3)] = (1.5, 1.5, 3) .
2
It is seen that this notion of compatibility for ties justifies the use of the mid-rank when
ties are present. Associated with every tied ranking, we may define a t!× d1 !,d2t!!,...,de ! matrix
of compatibility. Yu et al. (2002) considered the problem of testing for independence
between two random variables when the pattern for ties and for missing observations
are fixed.

7.5.2. General Block Designs


The parametric paradigm may be extended to the study of more general block designs.
We shall restrict attention to the Spearman distance. Consider the situation in which t
objects are ranked kh at a time, 2 ≤ kh ≤ t by b judges (blocks) independently and in
such a way that each object is presented to ri judges and each pair of objects (i, j) is
compared by λij judges, h = 1, . . . , b, i, j = 1, . . . , t. We would like to test the hypothesis
of no treatment effect, that is:
H0 : each judge, when presented with the specified kh objects, picks the ranking at
random from the space of kh ! permutations of (1, 2, . . . , kh ).

154
7. Tests for Trend and Association

In the study of the asymptotic behavior of various statistics for such problems, we
consider n replications of such basic designs. For any incomplete kh -ranking μ∗ , define
the score vector of
μ∗ = (μ∗ (1) , μ∗ (2) , . . . , μ∗ (kh ))
as μ∗ ranges over each of the kh ! permutations (Alvo and Cabilio, 1991). From Alvo
and Cabilio (1991), for a given permutation of (1, 2, . . . , kh ), indexed by s = 1, 2, . . . , kh !,
define the vector xh(s) whose ith entry is given by
   
(t + 1) ∗ kh + 1 (t + 1) ∗ t+1
μh(s) (i) − δh (i) = μ (i) − δh (i), (7.13)
(kh + 1) 2 (kh + 1) h(s) 2

δh (i) is either 1 or 0 depending on whether the object i is, or is not, ranked in block h,
and μ∗h(s) (i), as defined above, is the rank of object i for the permutation indexed by s
for block pattern h. This is also the corresponding ith row element of column s of

kh !
T ∗h = T C h.
t!

An (i, j) element of kt!h ! T S C h kt!h ! T S C h is thus of the form

kh ! 
  
(t + 1) ∗ t+1 (t + 1) ∗ t+1
μ (i) − μ (j) − δh (i)δh (j),
s=1
(kh + 1) h(s) 2 (kh + 1) h(s) 2

For a specific pattern of missing observations for each of the b blocks, the matrix of
scores is given by
 
∗ ∗ ∗ ∗ k1 ! k2 ! kb !
T = (T 1 | T 2 | . . . | T b ) = T C1 | C2 | . . . | Cb ,
t! t! t!

where the index of each block identifies the pattern of missing observations, if any, in
that block. It was shown in Alvo and Cabilio (1991) that the proposed test rejects H0
for large values of
G ≡ (T ∗ f ) (T ∗ f ) = f  (T ∗ ) T ∗ f , (7.14)
b
where the h=1 kh ! dimensional vector f is the vector of frequencies, for each of the b

patterns of the observed incomplete rankings. That is, f = (f 1 | f 2 | · · · | f b ) , where
f h is the kh ! dimensional vector of the observed frequencies of each of the kh ! ranking
permutations for the incomplete pattern h = 1, 2, . . . , b. Using the fact that for these
distance measures the matrix T ∗ is orthogonal to the vector of 1’s, and proceeding in a
manner analogous to Alvo and Cabilio (1991) gives the following. Moreover, we have

155
7. Tests for Trend and Association

Theorem 7.7. Under H0 as n → ∞,

1 L
√ T ∗f −
→ N (0, Γ) , (7.15)
n

where N (0, Γ) is a multivariate normal with mean 0 and covariance matrix

b   
1 kh ! kh !
Γ= T Ch T Ch . (7.16)
h=1
kh! t! t!

Thus 
L
n−1 G = n−1 (T ∗ f ) (T ∗ f ) −
→ αi zi2 , (7.17)
where {zi } are independent identically distributed normal variates, and {αi } are the
eigenvalues of Γ.
In the Spearman case, for a given permutation of (1, 2, . . . , kh ), indexed by s =
1, 2, . . . , kh !, the corresponding ith row element of column s of kt!h ! T S C h is found to be
   
(t + 1) ∗ kh + 1 (t + 1) ∗ t+1
μh(s) (i) − δh (i) = μ (i) − δh (i), (7.18)
(kh + 1) 2 (kh + 1) h(s) 2

δh (i) is either 1 or 0 depending on whether the object i is, or is not, ranked in block h,
and μ∗h(s) (i), as defined above, is the rank of object i for the permutation indexed by s
for block pattern h.
Lemma 7.1. In the Spearman case we have that the t × t matrix

b    b
1 kh ! kh !
Γ= T SCh T SCh = γh2 Ah
h=1
kh! t! t! h=1

where ⎧
kh −1

⎨ kh
δh (j) on the diagonal,
Ah =

⎩ − 1 δ (j) δ (j  ) off the diagonal,
kh h h

 h (t+1) 2
and γh2 = (kh1−1) kj=1 (kh +1)
j − (t+1)
2
. The elements of Γ are thus
⎧  
⎪ 2 1 b kh −1

⎨ (t + 1) 12 h=1 δ (j)
kh +1 h
on the diagonal,

  (7.19)

⎩ (t + 1)2 − 1 b
⎪ 1
δ (j)δh (j  ) off the diagonal.
12 h=1 kh +1 h

Note that the elements of each row of Γ sum to 0, so that rank (Γ) ≤ t − 1.

156
7. Tests for Trend and Association

The matrix Γ with elements given in Lemma 7.1 is closely related to the information
matrix of a block design. John and Williams (1995) details how this matrix occurs in
the least squares estimation of treatment effects and the role its eigenvalues play in
determining optimality criteria for choosing between different designs. This information
matrix A has components as follows:
⎧ b
kh −1

⎨ h=1 kh δh (i) on the diagonal,
(7.20)
⎩ − b
⎪ 1
h=1 kh δh (i)δh (j) off the diagonal.

Note that A and Γ share the same rank.


In view of the independence of the observations in each block, we may define a
smooth alternative for each block:
 
π xh(s) ; θ h = exp θ h xh(s) − K (θ h ) / (kh !) , s = 1, . . . , kh !, h = 1, . . . , b,

where θ h represents the kh × 1 vector of parameters associated with the kh objects


ranked in block h. We interpret the vectors xh(s) as column vector with values taken by
a random vector X (h) with probability mass function

P X (h) = xh(s) = π θ h ; xh(s) , s = 1, . . . , kh !.

The smooth alternative for the entire block design will be the product of the models for
each block. We are interested in testing

H0 : θ h = 0, for all h,
H1 : θ h = 0, for some h.

The log likelihood, derived from the joint multinomial probability function, is given by


b 
kh !

l (θ) ∼ nh(s) log π θ h ; xh(s)
h=1 s=1
k !


b h 
b 
kh !
= θ h nh(s) xh(s) − nh(s) K (θ h )
h=1 s=1 h=1 s=1

where nh(s) represents the frequency of occurrence of the value xh(s) . Under this formu-
lation, using a similar argument as in Section 7.2, we can show that the score test leads
to the result in Theorem 7.7. Two examples of block designs are considered below.

Example 7.5. Balanced Incomplete Block Designs (BIBD)


In the complete ranking case kh = t, b = n, so that the design becomes a randomized

157
7. Tests for Trend and Association

block. An example of a test in such a situation is the Friedman Test with test statistic
t 
 2
n (t + 1)
GS = Ri − ,
i=1
2

where Ri is the sum of the ranks assigned by the judges to object i. Under H0 , as
n → ∞,  
1 L t (t + 1)
GS −
→ χ2t−1 .
n 12
An interpretation of the Friedman statistic is that it is essentially the average of the
Spearman correlations between all pairs of rankings. In the balanced incomplete block
design, we have kh = k, ri = r, λij = λ, bk = rt, and λ (t − 1) = r (k − 1) . An example
of a test in such a case is due to Durbin whose test statistic is
t 
 2
(t + 1) nr (t + 1)
GS = Ri − .
i=1
(k + 1) 2

Under H0 , as n → ∞,  
1 L λt(t + 1)2
GS −
→ χ2t−1 .
n 12 (k + 1)

Example 7.6. Group Divisible and Cyclic Design


It may not always be possible to construct balanced incomplete block designs. For
example, the smallest number of replications for a BIBD is r = λ(t−1)
(k−1)
. When t = 6, k = 4,
the balanced design would require r = 10 replications and hence rt = 60 experimental
units. Such a large number of units may either not be available or too costly for the
researcher to acquire. The partially balanced group divisible design helps to reduce the
number of experimental units required at the cost of forcing some comparisons between
treatments to be more precise than others. In this design, the t objects occur in g groups
of d objects, t = gd. Within each group pairs of objects are compared by λ1 judges,
whereas each pair of objects between groups is compared by λ2 judges. Such designs
must satisfy the additional conditions

bk = rt, r (k − 1) = (d − 1) λ1 + d (g − 1) λ2 .

An example of a group divisible design is given in Table 7.9. Here, t = 6, b = 3, k =


4, r = 2, g = 3, d = 2, λ1 = 2, λ2 = 1. Treatments (1, 4) , (2, 5) , (3, 6) are compared
λ1 = 2 times whereas all the other pairs are compared only once. The number of
experimental units required here is rt = 12 compared to the 60 for a BIBD.
Incomplete block designs often require the use of tables which may not always be
available in the field. Moreover, care must be taken to record data correctly for such
designs. Cyclic designs on the other hand are easily constructed from an initial block and
the treatments can be easily assigned. Such designs are obtained by cyclic development of

158
7. Tests for Trend and Association

Table 7.9.: Example of a group divisible design


Treatments 1 2 3 4 5 6
Block 1 X X X X
Block 2 X X X X
Block 3 X X X X

Table 7.10.: Example of a cyclic design


Treatments 1 2 3 4 5 6
Block 1 X X X
Block 2 X X X
Block 3 X X X
Block 4 X X X
Block 5 X X X
Block 6 X X X

an initial block or combinations of such sets. Let λ0 = r, and λj−1 = λ1j , j = 2, . . . , t, be


the number of judges that compare object 1 with object j. The matrix A is a circulant
related to the matrix derived by the cyclic development of (λ0 , λ1 , . . . , λt−1 ) , and its
eigenvalues are given in John and Williams (1995). The eigenvalues of Γ are
  
(t + 1)2 r (k − 1) 1 
t−1
2πih
αi = − λh cos , i = 1, 2, . . . , t − 1.
12 (k + 1) (k + 1) h=1 t

An example of a cyclic design with t = 6, k = r = 3, b = 6 is given in Table 7.10. Note


that this is not a BIBD since for example, λ14 = 2 = λ13 = 1.

7.6. Exercises
Exercise 7.1. 1. Calculate E [W ] under the null hypothesis.
t n 
j=1 ij − 3n (t + 1) and then use the
12 2
Hint: First show that W = nt(t+1) i=1 R
properties of the ranks from Lemma 3.1 under the null hypothesis.

Exercise 7.2. Obtain for t = 4, an interpretation for the components of T S p .

Exercise 7.3. Some evidence suggests that anxiety and fear can be differentiated from
anger according to feelings along a dominance-submissiveness continuum. In order to
determine the reliability of the ratings on a sample of animals, the ranks of the ratings
given by two observers, Ryan and Jacob, were tabulated below. Perform a suitable test
at the 5% significance level whether Ryan and Jacob agree on their ratings.

159
7. Tests for Trend and Association

Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ryan’s Ranks 4 7 13 3 15 12 5 14 8 6 16 2 10 1 9 11
Jacob’s Ranks 2 6 14 7 13 15 3 11 8 5 16 1 12 4 9 10

Exercise 7.4. It is widely believed that the Earth is getting warmer year by year,
caused by the increasing amount of greenhouse gases emitted by human, which are trap-
ping more heat in the Earth’s atmosphere. However, some people cast doubts on that,
claiming that the average global temperature actually remains stable over time whereas
extreme weathers occur more and more frequently. In order to test these claims, the
data on the annual mean temperature (M EAN , in ◦ C), the annual maximum (M AX)
and minimum (M IN ) temperatures (in ◦ C) in Hong Kong for the years (Y EAR) 1961
to 2016 are extracted from the Hong Kong Observatory website, see Appendix A.8.

(a) Test whether the annual mean temperature M EAN shows an increasing trend at
the 5% significance level using (i) Spearman test and (ii) Kendall test.

(b) Test whether the annual temperature range (defined as RAN GE ≡ M AX −M IN )


shows a monotone trend at the 5% significance level using (i) Spearman test and
(ii) Kendall test.

(c) Based on the results found in (a) to (b), draw your conclusions and discuss the
limitation of your analysis.

Exercise 7.5. Eight subjects were given one, two, and three alcoholic beverages at
different widely space times. After each drink, the closest distance in feet they would
approach an alligator was measured. Test the hypothesis that there is no difference
between the different amounts of drinks consumed.
1 2 3 4 5 6 7 8
One drink 19.0 14.4 18.2 15.6 14.6 11.2 13.9 11.6
Two drinks 6.3 11.6 9.7 5.9 13.0 9.8 4.8 10.7
Three drinks 3.3 1.2 3.7 7.1 2.6 1.9 5.2 6.4

Exercise 7.6. There are 6 psychiatrists who examine 10 patients for depression. Each
patient is seen by 3 psychiatrists who provide a score as shown below. Analyze the data
using a nonparametric balanced incomplete block design. Are there differences among
the psychiatrists?

160
7. Tests for Trend and Association

Patient 1 2 3 4 5 6
1 10 14 10
2 3 2 1
3 7 12 9
4 3 8 5
5 20 26 20
6 20 14 20
7 5 8 14
8 14 18 15
9 12 17 12
10 18 19 13

161
8. Optimal Rank Tests
Lehmann and Stein (1949) and Hoeffding (1951b) pioneered the development of an op-
timal theory for nonparametric tests, parallel to that of Neyman and Pearson (1933)
and Wald (1949) for parametric testing. They considered nonparametric hypotheses
that are invariant under permutations of the variables in multi-sample problems1 so
that rank statistics are the maximal invariants, and extended the Neyman-Pearson and
Wald theories for independent observations to the joint density function of the maximal
invariants. Terry (1952) and others subsequently implemented and refined Hoeffding’s
approach to show that a number of rank tests are locally most powerful at certain alter-
natives near the null hypothesis. We shall first consider Hoeffding’s change of measure
formula and derive some consequences with respect to the two-sample problem. This
formula assumes knowledge of the underlying distribution of the random variables and
leads to an optimal choice of score functions and subsequently to locally most powerful
tests. Hence, for any given underlying distributions, we may obtain the optimal test
statistic.
In previous chapters, we did not assume knowledge of the underlying distributions of
the random variables. Instead, we derived test statistics through a parametric embedding
based on either a kernel or score function. The connection between the two approaches is
now evident. Using the results of the present chapter, it will then be possible to calculate
the efficiency of previously derived test statistics with respect to the optimal for specific
underlying distribution of the variables. We postpone the discussion of efficiency to
Chapter 9 of this book.
In Section 8.1 we see that the locally most powerful tests can be included in the
unified theory discussed in Section 6.1. In Section 8.2 we discuss the regression problem,
whereas in Section 8.3 we present the optimal choice of score function in the two-way
layout.

1
Lehmann & Stein considered the case for two samples and Hoeffding for multi-samples.

© Springer Nature Switzerland AG 2018 163


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_8
8. Optimal Rank Tests

8.1. Locally Most Powerful Rank-Based Tests


Rank-based tests are invariant to monotone transformations and consequently do not
depend on the scale of measurement. This is not always the case. For example, the
t-test is altered if x is replaced by log x In this section, we shall consider a change of
measure formula (Hoeffding, 1951b), known in the literature as Hoeffding’s Formula, in
connection with the distribution of the ranks of the data. In this approach, we are able
to incorporate a priori knowledge of the underlying distributions of the random variables
to determine the locally most powerful test. Let Z1 , . . . , Zn be a sequence of independent
random variables with densities f1 , . . . , fn respectively. Suppose that the null hypothesis
postulates that the variables are identically distributed. Let R = (R1 , . . . , Rn ) denote
the vector of ranks where Ri is the rank of Zi among Z1 , . . . , Zn . The following lemma
is due to Hoeffding.

Lemma 8.1 (Hoeffding’s Formula). Assume that V1 < . . . < Vn are the order statistics
of a sample of size n from a density h. Then
 
P (R = r) = ... f1 (v1 ) . . . fn (vn ) dv1 . . . dvn (8.1)
(v1 <...vn )
 
f1 (v1 ) . . . fn (vn )
= ... n! h (v1 ) . . . h (vn ) dv1 . . . dvn (8.2)
(v1 <...vn ) n!h (v1 ) . . . h (vn )

1 f1 V(r1 ) . . . fn V(rn )
= Eh . (8.3)
n! h V(r1 ) . . . h V(rn )

To illustrate the use of Hoeffding’s lemma, we consider the two-sample problem where
X1 , . . . , Xm constitute a random sample from a cdf F having density f and Y1 , . . . , Yn−m
are a random sample from a cdf G having density g. As well, assume that f = 0 implies
g = 0. Substitute in Hoeffding’s formula (8.3)

f i = 1, . . . , m
fi =
g i = m + 1, . . . , n

and h = g, we have

1 f V(r1 ) . . . f V(rm )
P (R = r) = Eg .
n! g V(r1 ) . . . g V(rm )

164
8. Optimal Rank Tests

It follows that if R1 , . . . , Rm denote the ranks of the first population,



P {R1 = r1 , . . . , Rm = rm } = P (R1 = r1 , . . . , Rm = rm , Rm+1 , . . . , Rm+n )

1 f V(r1 ) . . . f V(rm )  
= Eg 1
n! g V(r1 ) . . . g V(rm )

m! (n − m)! f V(r1 ) . . . f V(rm )


= Eg
n! g V(r1 ) . . . g V(rm )

1 f V(r1 ) . . . f V(rm )
= n Eg , (8.4)
(m ) g V(r1 ) . . . g V(rm )

where V(1) < . . . < V(N ) are the order statistics from a random sample of n uniform
variables on (0, 1) and the second summation is over all possible permutations of the
first and second samples among themselves. The expectation Eg is with respect to the
probability measure under which the n observations are i.i.d. with common density
function g, assuming that g is positive whenever f is.
Corresponding to a density function f with cdf F , Hájek and Sidak (1967) considered
a general score function for location problems defined as

an (i, f ) = Eϕ Vn(i) , f , 1 ≤ i ≤ n (8.5)

n−1 1
= n i−1 ϕ (v, f ) v i (1 − v)n−i dv (8.6)
0

(1) (n)
where Vn < . . . < Vn are the order statistics from a uniform distribution on (0, 1)
and
f  (F −1 (v))
ϕ (v, f ) = , 0 < v < 1.
f (F −1 (v))

Theorem 8.1 (Location Alternatives). Assume |f  (x)| dx < ∞. Suppose that we have
a sample of size m from the first population and a sample of size n from the second. The
test with critical region

m
an (Ri f ) ≥ k
i=1

is the locally most powerful test for testing

"
m+n
H0 : f (x1 , . . . , xn ) = fX (xi ) (8.7)
i=1

against
"
m "
m+n
H1 : f (x1 , . . . , xn ) = fX (xi − Δ) fX (xi ) , Δ > 0.
i=1 i=m+1

Proof. See page 67 of Hájek and Sidak (1967).

165
8. Optimal Rank Tests

We shall consider two examples to illustrate this theorem which results from an
application of Hoeffding’s formula. These include the Wilcoxon test when g(x) = ex /(1+
ex )2 is the logistic density and the Fisher-Yates test when g is the standard normal
density.
Example 8.1 (Wilcoxon Test). Suppose that we have a random sample of size m from
the logistic distribution whose density is given by

e−x
f (x) = , x > 0.
(1 + e−x )2
1
Then, F (x) = 1+e−x
and

log f (x) = −x − 2 log 1 + e−x

from which
∂ f  ((x))
log f (x) = = −1 + 2 [1 − F (x)] .
∂x f ((x))
Hence,
f  (F −1 (V ))
= −1 + 2 [1 − V ] .
f (F −1 (V ))
Recall from Lemma 2.1 that the ith order statistic from a uniform has density

m!
fV(i) (x) = v i−1 (1 − v)m−i
(i − 1)! (m − i)!

Consequently,

  m!
E 1 − V(i) = v i−1 (1 − v)m+1−i dv
(i − 1)! (m − i)!
m! (i − 1)! (m + 1 − i)!
=
(i − 1)! (m − i)! (m + 2)!
i
= .
m+1
It follows that the locally most powerful test when the underlying density is logistic is
given by

   
m
f  (F −1 (V ))
m
 + ,
− E = − E −1 + 2 1 − V(ri )
i=1
f (F −1 (V )) i=1


m
Ri
= 2 −m
i=1
m+1

which is recognized as the Wilcoxon statistic.

166
8. Optimal Rank Tests

Example 8.2 (Fisher-Yates Test). Suppose that we have a random sample of size m
from the normal distribution with mean μ and variance σ 2 > 0 given by

1 1 x−μ 2
f (x) = √ e− 2 ( σ ) − ∞ < x, μ < ∞.
2πσ

Then,  
∂ 1 x−μ
log f (x) = − .
∂x σ σ
Once again, using the density of the ith order statistic, we see that
  
X(i) − μ m!
E = z [Φ (z)]i−1 f (x) [1 − Φ (z)]m−i dz
σ (i − 1)! (m − i)!
 1
m!
= Φ−1 (u) [u]i−1 [1 − u]m−i du
0 (i − 1)! (m − i)!
 
= E Φ−1 V(i) ,

where V(i) is the ith order statistic from a random sample of m uniform random variables
on the interval (0, 1). There is no closed form for this last expectation, though there are
various approximations.2 In fact, using the delta method first order approximation,
 
 −1  −1
  −1 i
E Φ V(i) ≈ Φ E V(i) = Φ .
m+1

This then yields the normal scores or Fisher-Yates test statistic


m
 
E Φ−1 V(Ri )
i=1

which may be approximated by


m  
−1 Ri
Φ .
i=1
m+1

The latter is called the van der Warden test.

2
 
See also Royston (1982) who provides the approximation E X(i) = μ +

−1 i ( m+1
i
)(1− m+1
i
)
σΦ m+1 1+ 2 .
2(m+2)[φ[Φ −1
m+1 ]]
i

167
8. Optimal Rank Tests

A parametric embedding argument can be used to give an alternative derivation of


the local optimality of the Fisher-Yates and Wilcoxon tests. In the two-sample problem,
define
 2 
  
π (x1j , x2j ; θ 1 , θ 2 ) = exp θ x j − K (θ ) p0j , j = 1, . . . , n!, (8.8)
=1


where θ = (θ 1 , . . . , θ k ) represents the parameter vector for sample (= 1, 2) and x1j ,
x2j are the data from sample 1 and sample 2 with respective sizes m and n − m that are
associated with the ranking (permutation) νj , j = 1, . . . , n!.
Under the null hypothesis H0 : θ1 = θ 2 , we can assume without loss of generality
that the underlying V1 , . . . , Vn from the combined sample are i.i.d. uniform (by consid-
ering G(Vi ), where G is the common distribution function, assumed to be continuous,
of the Vi ) and that all rankings of the Vi are equally likely. Hence (12.5) represents
an exponential family constructed by exponential tilting of the baseline measure (i.e.,
corresponding to H0 ) on the rank-order data. This has the same spirit as Neyman’s
smooth test of the null hypothesis that the data are i.i.d. uniform against alternatives
in the exponential family (4.1). The Neyman-Pearson lemma can be applied to show
that the score tests have maximum local power at the alternatives that are near θ = 0.
The Neyman parametric embedding in (4.1) makes these results directly applicable to
the rank-order statistics. In particular, this shows that the two-sample Wilcoxon test
of H0 is locally most powerful for testing the uniform distribution against the truncated
exponential distribution for which the data are constrained to lie in the range (0, 1) of
the uniform distribution. Note that these exponential tilting alternatives differ from the
location alternatives in the preceding paragraph not only in their distributional form
(truncated exponential instead of logistic) but also in avoiding the strong assumption of
the preceding paragraph that the data have to be generated from the logistic distribution
even under the null hypothesis (8.7).
Similar results are also valid for tests against scale alternatives. Define the scores
 
i
∼ (i)
a1n (i, f ) = ϕ1 EVn , f = ϕ1 ,f ,
n+1

where
f  (F −1 (v))
ϕ1 (v, f ) = −1 − F −1 (v) , 0 < v < 1.
f (F −1 (v))
∞
Theorem 8.2 (Scale Alternatives). Assume −∞ |xf  (x) dx| < ∞. The test with critical
region
n
ci a1n (Ri f ) ≥ k
i=1

168
8. Optimal Rank Tests

is the locally most powerful test for testing

"
n
H0 : f (x1 , . . . , xn ) = fX (xi )
i=1

against
 

n "
n
H1 : f (x1 , . . . , xn ) = exp −Δ ci fX (xi − μ) e−Δci , Δ > 0,
i=1 i=1

where the {ci } are regression constants.

Proof. See page 69 of Hájek and Sidak (1967).

Example 8.3 (Locally Most Powerful Test for Scale). Suppose that F has an absolutely

continuous density f for which |xf  (x)| dx < ∞. We would like to test

H0 : F (x) = G (x) versus H1 : G (x) = F (x/θ) , θ > 1.

Let  

m
f  V(rj )
Sm = EF −1 − V(qj ) ,
j=1
f V(rj )

where V(1) < . . . < V(n) are the order statistics of a random sample of N from F. Then
the locally most powerful rank test is given by



⎨1 Sm > kα
φ (q) = γ Sm = kα


⎩0 Sm < kα

In the special case where F is a normal distribution with mean 0 and variance σ 2 , then
the locally most powerful rank test is based on the sum


m
 2 
Sm = E Z(i) ,
j=1

where Z(1) < . . . < Z(N ) are the order statistics of a random sample of N from a standard
normal distribution. On the other hand, if f (x) = e−x , x > 0, then the locally most
powerful rank test is based on the sum of the Savage scores given by
m 
   
j
Sm = log 1 − −1 .
j=1
N +1

169
8. Optimal Rank Tests

One may well ask whether the unified theory described in Chapter 6 produces locally
most powerful tests. Indeed we recall the locally most powerful tests are a subset of the
class of general linear rank statistics. Consider the two-sample problem for a location
alternative. The locally most powerful test rejects for small values of


n1  
g  (V(μ(i)) )
Eg − ,
i=1
g(V(μ(i)) )

where V(1) < . . . < V(n) are the order statistics of a random sample of size n from g. Let
 
g  (V(j) )
h (j) = Eg −
g(V(j) )

and consider the Euclidean distance function



n
d (μ, ν) = [h (μ (i)) − h (ν (i))]2 .
i=1

Expanding this sum, we see that the test statistic induced by this distance function can
be shown to be   
n1 n1
g (V(μ(i)) )
h (μ (i)) = Eg − .
i=1 i=1
g(V(μ(i)) )
The demonstration is identical to the one using the permutations directly.

8.2. Regression Problems


We may also obtain locally most powerful tests for regression problems. Consider the
general regression model
Y = Xβ + , (8.9)
where Y is an n-dimensional vector response variable, X is an n×p design matrix, β is a
p-dimensional vector of regression parameters, and is a vector random error term with
independent identically distributed components having continuous distribution F (·) and
density f . Let Y1 , . . . , Yn be a sample of n observed responses and let

xi = (xi1 , . . . , xip )

be the corresponding covariates. Denote the order statistics Y(1) < . . . < Y(n) and let
their ranks be denoted by R = (R1 , . . . , Rn ) . The probability distribution of R is given
by
 " n
f (r; Δ) = f (ui − xi β) dui (8.10)
i=1

170
8. Optimal Rank Tests

where integration is over the set {(u1 , . . . , un ) |u1 < . . . < un } . Kalbfleisch (1978) arrived
at (8.10) through group considerations. Specifically, it was argued that the regression
model (8.9) conditional on x is invariant under the group of increasing differentiable
transformations acting on the response. In the parameter space, this leaves β invariant.
We may obtain the locally most powerful test of the hypothesis H0 : β = 0 by computing
the score function U at β = 0

∂ log f (r; β) 
U=  . (8.11)
∂β β=0

The lth component for l = 1, . . . , p is

1 ∂f (r; β)
Ul =
f (r; 0) ∂β
 "  n 
n 
= n! f (ui ) − x(j)l a (j) dui
i=1 j=1

n
= − x(j)l a (j)
j=1
n
= − xjl a (Rj ) , (8.12)
j=1

where

∂ log f Z(i)
a (i) = E (8.13)
∂Z(i)

and Z(i) is the ith order statistic of a sample of size n from f . Hence, when p > 1 we
obtain a vector of linear rank statistics.
If p = 1, then
 n
xj a (Rj )
j=1

is the usual simple linear rank statistic. There is a close connection in that case with
the usual Pearson correlation coefficient as we saw in Section 3.1 which we recall in the
next example.

Example 8.4. Consider the simple linear regression model

Yi = α + Δxi + εi , i = 1, . . . , n

171
8. Optimal Rank Tests

where {εi } are independent identically distributed random variables from a cumulative
distribution function have median 0. We wish to test

H0 : Δ = Δ 0

against
H1 : Δ = Δ0 .
Assume x1 < . . . < xn and let Ri denote the rank of the residual Yi − Δ0 xi among
{Yj − Δ0 xj } . The usual Pearson correlation coefficient is given by
n
i=1 (xi − x̄n ) (a (Ri ) − ā)
ρn =  
2 1/2
,
n 2 n
i=1 (x i − x̄ n ) i=1 (a (R i ) − ā n )

where the score function is an increasing function with a (1) = a (n) and

1
n
ā = a (Ri ) .
n i=1

It is seen that ρn is linearly equivalent to the linear rank statistic


n
xi a (Ri ) .
i=1

We may determine the optimal regression score statistic corresponding to specific


error distributions using (8.13) as in the following examples.
 
(i) Suppose f is the standard normal distribution. Then, a (i) = −E Z(i) which
leads to the Fisher-Yates normal scores test.

(ii) If f is the logistic density


e−z
f (z) = ,
(1 + e−z )2
then    
−1 + e−Z(i) 1
a (i) = E = 1 − 2E .
1 + e−Z(i) 1 + e−Z(i)
1
Since Y = 1+e−Z
has a uniform distribution on the interval (0, 1), it follows that
 
  i
a (i) = 1 − 2E Y(i) =1−2
n+1

which leads to the Wilcoxon statistic.

172
8. Optimal Rank Tests

(iii) If f is the extreme value density


z)
f (z) = e(z−e

then
 
a (i) = E 1 − eZ(i) .

We may obtain a general form for the score test in the regression model. The score
function U evaluated at Δ = 0 has variance-covariance matrix

I0 = E {U U  } = E xil xi l a (Ri ) a (Ri ) .
ii
n 
Suppose that i=1 xil = 0 and ii xil xi l = 0. Then,
 
E xil xi l a (Ri ) a (Ri ) = xil xi l E (a (Ri ) a (Ri ))
ii ii
 
= xil xil Ea2 (Ri ) + xil xi l E (a (Ri ) a (Ri ))
i i
=i
 
= xil xil Ea2 (Ri ) − xil xil E (a (Ri ) a (Ri ))
i  i

= xil xil Ea2 (Ri ) − E (a (Ri ) a (Ri ))
i

Now note

1 2 
n
1
Ea (Ri ) − E (a (Ri ) a (Ri )) =
2
a (i) − a (i) a (i )
n i=1 n (n − 1) i
=i
  2  
1 2
n
1
= a (i) − a (i) − 2
a (i)
n i=1 n (n − 1)
1  n
= (a (i) − ā)2
(n − 1) i=1
n
a(i)
where ā = i=1n .
Now set the design matrix X = (xij ) . It follows that the ll term of X  X is given by
 

n
xil xil ,
i=1

173
8. Optimal Rank Tests

and hence the variance-covariance matrix I0 becomes


  n 
1 
I0 = (X”X) (a (i) − ā)2 .
n − 1 i=1

Assuming that ā = 0, we have


  
1 
n
I0 = (X”X) a2 (i)
n−1 i=1

and hence using (8.12) and (8.13), the score test takes the form

U  I0−1 U.

8.3. Optimal Tests for the Method of n-Rankings


Sen (1968a) considered the problem of finding efficient tests by the method of n-rankings.
Specifically, he considered the two-way layout in situations where block effects are not
additive or when the method of ranking after alignment is not practicable. Alvo and
Cabilio (2005) extended his results to situations involving more general experimental
designs. In what follows, we begin with a review of the complete ranking case and then
consider the incomplete ranking situation.
Suppose that we have n judges (blocks) who rank t objects labeled (1, 2, . . . , t).
Each of the 1 ≤ i ≤ n judges is presented with 2 ≤ ki ≤ t objects which are ranked
independently by the judges. This ranking is a permutation of (1, 2, . . . , ki ) and in order
to indicate which objects are ranked by judge i, this is represented by

Ri = (Ri (o1 ) , Ri (o2 ) , . . . , Ri (oki )) ,

where the oj with 1 ≤ o1 < o2 < . . . < oki ≤ t are the labels of the actual objects being
ranked. Occasionally we may wish to represent this as a t-vector in which missing ranks
are represented by the symbol “_.” If ki = t, the ranking is said to be complete, otherwise
it is incomplete. We wish to test the hypothesis H0 : each judge, when presented with
the specified ki objects, picks the ranking at random from the space of ki ! permutations
of (1, 2, . . . , ki ).
Instead of basing our statistic directly on the ranks themselves, we may wish to
replace the ranks assigned by each judge by real valued score functions

a (j, ki ) , 1 ≤ j ≤ ki ≤ t.

In order to motivate the discussion, we begin in the next section with the complete case
where ki = t.

174
8. Optimal Rank Tests

8.3.1. Complete Block Designs


Consider the situation where each of the n independent rankings is complete and suppose
we wish to test

H0∗ : Each judge picks a complete ranking at random from the space of t!
permutations of (1, . . . , t)

For a given function a (j, t), 1 ≤ j ≤ t, and a ranking R define the vector of adjusted
scores



t
−1
a (R) = (a (R (1) , t)−at , a (R (2) , t)−at , . . . , a (R (t) , t) − at ) , where at = t a (r, t) .
r=1
(8.14)
Under H0∗ , E (a (R (j) , t)) = at , and the covariance matrix of a (R) is given by
 
t 1
Σ0 = σ I− J
2
(8.15)
t−1 0 t

where

t
−1
σ02 ≡ V ar (a (R (j) , t)) = t (a (r, t) − at )2 , (8.16)
r=1

I is the identity (t × t) matrix, and J is the (t × t) matrix of 1’s. A measure of similarity


between the complete rankings R1 and R2 is given by



t
A (R1 , R2 ) = a (R1 ) a (R2 ) = (a (R1 (j) , t) − at ) (a (R2 (j) , t) − at ) . (8.17)
j=1

Let the (t × t!) matrix T represent the collection of adjusted score vectors a (R) as R
ranges over all its t! possible values, and let f be the t! vector of frequencies of the
observed rankings. The (t! × t!) matrix T  T has components A (R1 , R2 ) with R1 and
R2 ranging over all t! permutations of (1, 2, . . . , t). With Ri , i = 1, . . . , n, representing
the observed rankings, and


n
Sn (j) = a (Ri (j) , t) ,
i=1

175
8. Optimal Rank Tests

let
S n = (Sn (1) , Sn (2) , . . . , Sn (t)) ,
so that T f = (S n − nat 1) , where 1 is the t−vector of 1’s. Proceeding as in (Alvo and
Cabilio (1991)), the proposed statistic is the quadratic form
    
t
−1   1 1 −1
n f (T T ) f = √ Tf √ Tf =n (Sn (j) − nat )2 . (8.18)
n n j=1

Large values of this statistic are inconsistent with H0 . Under H0∗ , as n → ∞, an


application of the central limit theorem for the multinomial shows that n−1/2 T f is
asymptotically normal with mean vector 0 and covariance matrix
 
1 1 1
T I − J T  = T T .
t! t! t!

Some manipulation shows this covariance to be equal to Σ0 . Consequently, since I− 1t J


is an idempotent of rank (t − 1) , as n → ∞ the statistic in (8.18) has asymptotically
the same distribution as t (t − 1)−1 σ02 χ2(t−1) , where χ2 is a random variable distributed
with a chi-square distribution with (t − 1) degrees of freedom. This leads to the same
statistic as defined in Hájek and Sidak (1967) and Sen (1968a), namely

t−1 
t
Qn =  t 2 (Sn (j) − nat )2 (8.19)
n r=1 (a (r, t) − at ) j=1

which, under H0∗ , as n → ∞ has asymptotically a chi-square distribution with (t − 1)


degrees of freedom.

8.3.2. Incomplete Block Designs


Consider the situation described earlier, in which the ith judge ranks 1 ≤ ki ≤ t objects.
In the development that follows, we will at first consider the situation in which a basic
design of b blocks is defined, and this basic design is replicated n times, so that a total
of nb judges rank the specified subsets of the t objects. The indicator function δi (j) is
either 1 or 0 depending on whether the object 1 ≤ j ≤ t is, or is not, ranked in block i.
The index set of objects ranked in block 1 ≤ h ≤ b of this basic design is denoted by

Ωh = {1 ≤ j ≤ t | δh (j) = 1} ,

and the basic design is defined by

Ω∗ = {Ωh | 1 ≤ h ≤ b} .

176
8. Optimal Rank Tests

Let rj = bh=1 δh (j) denote the total number of blocks in the basic design that include
b 
object j, and λjj  = h=1 δh (j)δh (j ) the number of such blocks which include both
objects 1 ≤ j = j  ≤ t, so that nrj judges rank object j, and nλjj  judges rank both
objects j and j  . For an incomplete ranking pattern Ωh , let

{Rh(s) , s = 1, 2, . . . , kh !}

represent the set of possible kh -rankings, that is the permutations of (1, 2, . . . , kh ) within
the specified incomplete pattern. For any such incomplete kh -ranking with a given
pattern of missing observations, associate a matrix of compatibility C h whose s =
1, 2, . . . , kh ! columns are indicators identifying which of the t! complete rankings in-
dexing its rows, are compatible with the particular kh -permutation Rh(s) . The analogue
of the adjusted matrix T , the collection of adjusted score vectors a(R) is given by

T ∗ = (T ∗1 | T ∗2 | . . . | T ∗b )

where
kh !
T ∗h = T C h.
t!

Denote by C Rh(s) the class of complete rankings compatible with the specified kh -
permutation Rh(s) indexed by column s of C h . Under H0 , the columns of T ∗h are the
conditional expected adjusted scores

E a(R) C Rh(s) .

This conditional expectation provides the appropriate weighting for scores in an in-
complete design. As Theorem 8.3 below indicates, this weighted score is given by the
following definition.
Definition 8.1. For a given score function a (j, t), 1 ≤ j ≤ t, if object j is ranked in a
given block 1≤ h ≤ b, and if Rh (j) = r, the weighted score is given by


t−k h +r    −1
∗ q−1 t−q t
a (r, kh ) = a (q, t) , r = 1, 2, . . . , kh . (8.20)
q=r
r−1 kh − r kh

Theorem 8.3. For an incomplete ranking pattern Ωh , 1≤ h ≤ b, with Rh(s) s =


1, 2, . . . , kh ! denoting the permutations of (1, 2, . . . , kh ) within the specified incomplete
pattern, the row j, column s element of T ∗h = kt!h ! T C h is

a∗ Rh(s) (j) , kh − at δh (j) .

Proof. For an incomplete ranking pattern Ωh , 1≤ h ≤ b, the row j, column s element


of T ∗h = kt!h ! T C h is the average of (a (R (j) , t) − at ) over all t!/kh ! complete rankings R

177
8. Optimal Rank Tests

compatible with Rh(s) . If object j is ranked in pattern Ωh , and if Rh (j) = r, then there
are exactly   
q−1 t−q
(t − kh )!
r − 1 kh − r
complete compatible rankings R, for which

R(j) = q, r ≤ q ≤ t − kh + r,

so that the average of such scores a (R (j) , t) is given by (8.20). If on the other hand ob-
ject j is not ranked in pattern Ωh , there are (t − 1)!/kh ! complete rankings R compatible
with Rh(s) for which R(j) = q, 1≤ q ≤ t, so that the sum of such scores is


t
(t − 1)!/kh ! a (q, t) ,
q=1

and thus the average score is at .

Note that if kh , the number of objects ranked is the same for each block, the weighted
scores in (8.20) are all the same function of r. Note further that

 
t−k+r
q−1

t−q
 −1
t
= 1.
q=r
r−1 k−r k

The matrix (T ∗ ) T ∗ contains measures of similarity between any two incomplete


rankings with patterns in Ω∗ , given as averages of the measures of similarity in (8.17)
between all the complete rankings compatible with each of the specified incomplete
rankings. Under H0 , for an incomplete ranking Rh ,


kh 
t
kh−1 ∗ ∗
a (r, kh ) = E (a (Rh (j) , kh )) = E (a (R (j) , t)) = at = t −1
a (r, t) ,
r=1 r=1

and the vector of adjusted weighted scores,

a∗ (Rh ) = [(a∗ (Rh (1) , kh ) − at ) δh (1) , (a∗ (Rh (2) , kh ) − at ) δh (2) , . . . ,


(a∗ (Rh (t) , kh ) − at ) δh (t)] (8.21)

has the covariance matrix Σh = γh2 Ah , where the (t × t) matrix is given by



kh −1
δh (j) on the diagonal
Ah = kh
 (8.22)
− kh δh (j) δh (j ) off the diagonal
1

178
8. Optimal Rank Tests

and
−1

kh
γh2 = (kh − 1) (a∗ (r, kh ) − at )2 .
r=1
b
Proceeding as in Alvo and Cabilio (1999), define the h=1 kh ! dimensional vector

f = (f 1 | f 2 | · · · | f b ) ,

where f h is the kh ! dimensional vector of the observed frequencies of each of the kh !


ranking permutations for the incomplete pattern Ωh , h = 1, 2, . . . , b. In order to extend
the notation for the weighted scores to all the nb rankings, define

a∗ (r, ki ) = a∗ (r, kh ) if i ≡ h (mod b) , b + 1 ≤ i ≤ nb.

Analogous to the complete case statistic the test statistic in this more general setting is
given by
  
−1  ∗ 1 ∗
∗ 1 ∗
n f (T T ) f = √ T f √ T f
n n
 2
 t 
nb
−1 ∗
= n (a (Ri (j) , ki ) − at ) δi (j) . (8.23)
j=1 i=1

Set

S nb = (Snb ∗
(1) , Snb ∗
(2) , . . . , Snb (t)) , r = (r1 , r2 , . . . , rt ) ,
where

nb

Snb (j) = a∗ (Ri (j) , ki ) δi (j) .
i=1

Since

nb
δi (j) = nrj
i=1

it follows that

nb
(S nb − nāt r) = a∗ (Ri ) .
i=1

The test statistic is then



t
−1 −1 ∗
Gn = n (S nb − nāt r) (S nb − nāt r) = n (Snb (j) − nrj at )2 . (8.24)
j=1

179
8. Optimal Rank Tests

Under H0 , n−1/2 T ∗ f has covariance matrix


b
Σ0 = γi2 Ai , (8.25)
i=1

so that an alternative form of the statistic in (8.24) is

n−1 (T ∗ f ) Σ− ∗
0T f = n
−1
(S nb − nāt r) Σ−
0 (S nb − nāt r) (8.26)

where Σ−
0 is a generalized inverse of Σ0 .

8.3.3. Special Cases: Wilcoxon Scores


When the score function is a (j, t) = j, the Wilcoxon score is simply the (unweighted)
Spearman correlation. In this case, if object j is ranked in block i, and if Ri (j) = r, the
weighted score in (8.20) becomes


t−k i +r    −1   −1
∗ q t−q t t+1 t t+1
a (r, ki ) = r =r =r , (8.27)
q=r
r ki − r ki t − ki ki ki + 1

This form is derived in Alvo and Cabilio (1991). Further at = (t + 1) /2, so that


b
Sb∗ (j) − rj at = (t + 1) (ki + 1)−1 (Ri (j) − (ki + 1) /2) δi (j) , (8.28)
i=1

and the test statistic Gn becomes

t  2  b   2
 t+1 
t  ki + 1
Sb∗ (j) − rj = (t + 1) 2
(ki + 1)−1 Ri (j) − δi (j) .
j=1
2 j=1 i=1
2
(8.29)
1 2 −1
In this case, γi2 = 12
(t + 1) ki (ki + 1) .

8.3.4. Other Score Functions


In the situation where the complete rankings Ri = (Ri (1) , Ri (2) , . . . , Ri (t)) arise from
independent random variables (Xi1 , Xi2 , . . . , Xit ) with absolutely continuous distribution

functions Fij (x) = Fi (x − τj ) , where τj = 0, and Fi (x) have continuous densities
∞ 2
fi (x) , for which −∞ fi (x) dx < ∞, Sen (1968b) shows that the asymptotic distribution
under local translation alternatives for test statistics Qn of the form in (8.19) is noncen-
tral chi-squared with (t − 1) degrees of freedom and a specified noncentrality parameter
Δn (a). An upper bound Δ0n to this noncentrality parameter over all possible choices of

180
8. Optimal Rank Tests

score functions a (r, t) is derived. For various distributions of (Xi1 , Xi2 , . . . , Xit ), iden-
tical over the blocks, Sen (1968b) derives the form of the optimal scores a (r, t) in the
sense that Δn (a) = Δ0n . In this sense, the Wilcoxon score statistic discussed above is
optimal in the case that the rankings result from complete samples from the logistic
distribution with density f (x) = e−x / (1 + e−x ) ; −∞ < x < ∞.
2

The score function for an incomplete ki -block is of the form


t−k i +r    −1
∗ q−1 t−q t
a (r, ki ) = a (q, t) = K (ki , t) a (r, ki ) , (8.30)
q=r
r − 1 ki − r ki

where K (ki , t) depends only on the number of objects ranked in the block. Other score
functions considered by Sen (1968) also have this property. In particular we have the
following.

(a) With the score a (1, t) = 1, a (t, t) = −1, a (r, t) = 0 otherwise, Qn is optimal when
sampling from the uniform distribution with density f (x) = 1; 0 ≤ x ≤ 1. Direct
substitution into (8.20) gives a∗ (r, ki ) = kti a (r, ki ) .

(b) With the score a (1, t) = 1, a (r, t) = 0 otherwise, Qn is optimal when sampling
from the exponential distribution with density f (x) = e−x ; 0 ≤ x < ∞. Again,
direct substitution into (8.20) gives a∗ (r, ki ) = kti a (r, ki ) .

(c) When sampling from the double exponential distribution with density f (x) = e−|x| ;
 t −t
−∞ < x < ∞, Qn associated with the score function a (r, t) = 1 − 2 r−1 i=0 i 2 is

shown to be optimal. In order to derive the form of a (r, ki ) in this case we make
use of the following lemma.

Lemma 8.2. Let F (x; n) = xi=0 ni pi (1 − p)n−i , 0 ≤ x ≤ n, be the cumulative bino-
mial distribution function. Then, for all 1 ≤ x ≤ m ≤ n


n−m+x    −1
i−1 n−i n
F (x − 1; m) = F (i − 1; n) . (8.31)
i=x
x−1 m−x m

Proof. See Alvo and Cabilio (2005) for the proof of this lemma.

Suppose that we have random variables Xi1 , . . . , Xit which may or may not be observ-
able but which underlie the ranks. We assume that the variables are independently
distributed with continuous cdf Fi1 (x) , . . . , Fit (x) respectively for i = 1, . . . , n. We
would like to test the null hypothesis that

Fi1 (x) = · · · = Fit (x) ≡ Fi (x) , i = 1, . . . , n,

181
8. Optimal Rank Tests

against the alternative that


t
Fij (x) = Fi (x − τj ) , j = 1, . . . , t; τj = 0.
j=1

From (8.20)
t−k +r q−1      −1

 i  t q−1 t−q t
a∗ (r, ki ) = 1 − 2 2−t ,
q=r i=0
i r−1 ki − r ki

 ki −ki
we have that the quantity in square brackets is equal to F (r − 1; ki ) = r−1 i=0 i 2 ,

so that a (r, ki ) = a (r, ki ) .
i ∗
The scores considered in this section have certain properties. Since ki−1 kr=1 a (r, ki ) =
at , it follows that for all such scores, at = K (k, t) āk . Further, if the design is such that
the number of objects ranked in each block is constant, that is ki = k for all i, then the
incomplete scores are equivalent to the complete case ones.

8.3.5. Asymptotic Distribution Under the Null Hypothesis


Consider the situation in which a basic design of b blocks is replicated n times. Under
H0 , as n → ∞, n−1/2 T ∗ f is asymptotically normal with mean vector 0 and covariance
matrix Σ0 given in (8.25), and thus the test statistic Gn is asymptotically distributed

as αi zi2 , where {zi } are independent identically distributed normal variates, and {αi }

are the eigenvalues of Σ0 . The matrix A = bi=1 Ai , with Ai given in (8.22) is the
information matrix of the design, and rank (Σ0 ) = rank (A) ≤ t − 1. If the design
is connected, that is λjj  ≥ 1, for all 1 ≤ j = j  ≤ t, so that every pair of objects is
compared, then rank (A) = t − 1. The critical values of the asymptotic distribution
are easily calculated using the methods in Jensen and Solomon (1972) (see Alvo and
Cabilio, 1995 for such an application.)
If the number of objects ranked in each block is the same, say ki = k, the weighted
scores are all a∗ (r, k). Similarly the variances σi2 are all equal to (k − 1) k −1 γ 2 , where

γ 2 = (k − 1)−1 kr=1 (a∗ (r, k) − at )2 . The covariance matrix Σ0 = γ 2 A,
 k−1
rj on the diagonal,
A= k (8.32)
− k λjj 
1
off the diagonal.

For many designs, including balanced incomplete blocks, cyclic, and group divisible
designs, the eigenvalues of A are found analytically. See Alvo and Cabilio (1999) for
details.

182
8. Optimal Rank Tests

The asymptotics and the null hypothesis may be recast in a different setting. Specif-
ically consider the situation in which random variables (Xi1 , Xi2 , . . . , Xit ) , which may
or may not be observable, underlie the rankings

Ri = (Ri (1) , Ri (2) , . . . , Ri (t)) , i = 1, . . . , b.

These random variables are assumed independent with absolutely continuous distribu-

tion functions Fij (x) = Fi (x − τj ) , where τj = 0, and Fi (x) have continuous densities
∞
fi (x) , for which −∞ fi2 (x) dx < ∞. The null hypothesis of random uniform selection
of rankings becomes H0 : τ = 0, where τ  = (τ1 , τ2 , . . . , τt ). If the asymptotics of interest
are simply that the number of blocks b becomes large, the definitions and notation used
earlier may be modified by setting n = 1 as appropriate. The test statistic may be
rewritten as
G∗b = (S b − āt r) Σ−
0 (S b − āt r) (8.33)
where Σ0 defined in (8.25) is the covariance matrix, under H0 , of

(S b − āt r) = (Sb∗ (1) − āt r1 , Sb∗ (2) − āt r2 , . . . , Sb∗ (t) − āt rt ) . (8.34)

As b → ∞, if the design is connected, G∗b has an asymptotic χ2 distribution with t − 1


degrees of freedom.

8.3.6. Asymptotic Distribution Under the Alternative


In block design Ωh with kh objects being ranked, h = 1, 2, . . . , b, let
(j)
phr = P (Rh (j) = r|δh (j) = 1)δh (j)

represent the probability that if object j is being ranked it will be assigned the rank
r, 1 ≤ r ≤ kh . Denote the mean score for this ranking pattern by


kh
a∗ (r, kh ) phr .
(j)
μh (j) = (8.35)
r=1

Note that according to our convention,

μi (j) = μh (j) , ki = kh if i ≡ h, (mod b) , rb + 1 ≤ i ≤ nb.

The variables


t 
sb
Us = cj [a∗ (Ri (j) , ki ) δi (j) − μi (j)] , s = 1, 2, . . . , n (8.36)
j=1 i=(s−1)b+1

183
8. Optimal Rank Tests

are independent with zero means, and with n−1 ns=1 E |Us |3 bounded. Thus, by the
Lindeberg-Feller Central Limit Theorem (Chapter 2.1) it follows that as n → ∞,

n−1/2 ns=1 Us is asymptotically normal for all c = (c1 , c2 , . . . , ct ) = (1, 1, . . . , 1). Thus
we have that n−1/2 (S nb − E (S nb )) has an asymptotic multivariate normal distribution.
Note that the expected values of the elements of S nb are


b

E (Snb (j)) = nμb (j) = n μh (j) δh (j) . (8.37)
h=1

Specializing the situation described in the previous section, assume that the ab-
solutely continuous distributions of the (Xi1 , Xi2 , . . . , Xit ) , are of the form Fij (x) =
∞
F (x − τj ) , with continuous density f (x) for which −∞ f 2 (x) dx < ∞. Consider the
alternatives
H1n : τ = τ n = n−1/2 θ, θ = (θ1 , θ2 , . . . , θt ) .
In order to investigate the asymptotic distribution of the test statistic

n−1 (S nb − nāt r) Σ−


0 (S nb − nāt r)

under such local translation alternatives, we use arguments and notation similar to those
in (Sen (1968b)). Thus let
  ∞
kh − 2
(h)
βs = F (x)r−2 (1 − F (x))kh −r f 2 (x) dx, s = 0, 1, . . . , kh − 2
s −∞

(h) (h)
where β−1 = βkh −1 = 0. By definition, for all j ∈ Ωh ,

(j)
 ∞ 
phr = P Xs1 ≤ x, . . . , Xsr−1 ≤ x, Xsr+1 > x, . . . , Xskh > x|Xj = x dFj (x) ,
Sj −∞

(8.38)
where the summation extends over all possible choices of (s1 , . . . , sr−1 ) from Ωh \ {j} ,
with (sr+1 , .., skh ) the complementary set. Use of the distributional assumptions, inde-
pendence, and Taylor series expansion and some manipulation shows that the probability
in (8.38) can be written as
(h) 
phr = kh−1 + n−1/2 kh θj − θh βr−2 − βr−1 + o n−1/2 ,
(j) (h)


where θh = kh−1 θj , and the summation is over all j ∈ Ωh . Hence we see that μb (j)
defined in (8.35) is related to the null mean through

μb (j) − rj āt = η (j, a∗ ) + o n−1/2 , (8.39)

184
8. Optimal Rank Tests

where
 

b

kh 
η (j, a∗ ) = n−1/2 kh θj − θh δh (j) a∗ (r, kh )
(h) (h)
βr−2 − βr−1 . (8.40)
h=1 r=1

Turning to the covariance matrix, define


(j,j  )
ph,rs = P (Rh (j) = r, Rh (j  ) = s|δh (j) δh (j  ) = 1)δh (j) δh (j  )

which can be shown to be expressible as

(j,j  ) 1
ph,rs = + o n−1/2 . (8.41)
kh (kh − 1)

The covariance matrix of the vector of scores

[(a∗ (Rh (1) , kh ) − μh (1)) δh (1) , (a∗ (Rh (2) , kh )

− μh (2)) δh (2) , . . . , (a∗ (Rh (t) , kh ) − μh (t)) δh (t)] (8.42)

can be written as  
kh −1 −1/2
γh2 +o n δh (j)
kh
on the diagonal and  
1 −1/2
−γh2 +o n δh (j) δh (j  )
kh
off the diagonal. It follows that

Cov n−1/2 S nb − Σ0 → 0 as n → ∞,

where Σ0 is defined in (8.25). Combining this result with the asymptotic normality of
n−1/2 Snb it follows that if the basic design is connected, the statistic

n−1 (S nb − nāt r) Σ−


0 (S nb − nāt r)

has for n → ∞, a noncentral chi-squared distribution with t − 1 degrees of freedom and


noncentrality parameter

∗ 
[η (1, a∗ ) , . . . , η (t, a∗ )] Σ− ∗
0 [η (1, a ) , . . . , η (t, a )]

where η (j, a∗ ) is defined in (8.40).

185
8. Optimal Rank Tests

The form of this noncentrality parameter is simplified if kh = k for all h = 1, 2, . . . , b.


In such a case, it becomes
 2
k
a∗ (r, k) (βr−2 − βr−1 )
r=1
k 2 (k − 1) k 2
Θ A− Θ
r=1 (a (r, k) − at )


where Θ is the vector with components bh=1 θj − θh δh (j) , j = 1, 2, .., t, and A is
defined in (8.32). This parameter may be written as a multiple of a correlation coefficient
which is maximized when a∗ (r, k) − at = c (βr−2 − βr−1 ). This relationship leads to the
same conclusions as in Sen (1968b), which is that the score functions detailed in Section 4
are optimal for the stated underlying distributions.

8.4. Exercises
Exercise 8.1. Let X have an exponential distribution with mean 1. Find the optimal
score statistic for the scale alternative.

Exercise 8.2. Let X have a Cauchy distribution. Find the optimal score statistic for
the location alternative.

186
9. Efficiency
In this chapter, we consider the asymptotic efficiency of tests which requires knowledge
of the distribution of the test statistics under both the null and the alternative hypothe-
ses. In the usual cases such as for the sign and the Wilcoxon tests, the calculations
are straightforward. However, the calculations become more complicated in the multi-
sample situations and for these, we appeal to Le Cam’s lemmas. This is illustrated in
the case of test statistics involving both the Spearman and Hamming distances. The
smooth embedding approach is useful in that for a given problem, it leads to test statis-
tics whose power function can be determined. The latter is then assessed against the
power function of the optimal test statistic derived from Hoeffding’s formula for any
given underlying distribution of the data.

9.1. Pitman Efficiency


Suppose that one is interested in testing a null hypothesis H0 against an alternative H1 .
Consider two competing tests Tn,1 , Tn,2 both of which are at level α and are consistent,
that is, the power function converges to 1 as the sample size gets large. We may choose
between them only if we can define a suitable criterion that takes into account the
size as well as the power function of the test. There are a number of different criteria
as detailed in Chapter 10 of Serfling (2009). Here, we will only describe the criteria
proposed by Pitman which is defined for local alternatives close to the null hypothesis.
The criteria provides a simple measure of the comparison between two tests but it is not
as informative as the power function would be since the latter looks at a full range of
alternatives.
Let X1 , . . . , Xn be a random sample from some cdf F having density f (x; θ) for
some parameter θ. Let Ω0 , Ω1 denote the spaces where θ takes its values under the null
and alternative hypotheses respectively. The power function of a test is defined to be
the probability of rejecting the null hypothesis and is given by

βn (Tn ; θ) = Pθ (Tn rejects H0 ) .

© Springer Nature Switzerland AG 2018 187


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_9
9. Efficiency

The size of the test is defined to be

αn = sup βn (Tn ; θ) .
θΩ0

In the specific situation where


H0 : θ = 0
against
H1 : θ > 0.
we shall be concerned with smooth local alternatives of the form θn = √hn . These
alternatives make it more difficult for a test to choose between the hypotheses as n gets
large. We shall suppose that for both tests the following three conditions are satisfied:

(i) the test statistic Tn is asymptotically normal under Pθn


√ 
n (Tn − μ (θn ))
Pθn ≤ x −→ Φ (x)
σ (θn )

σ 2 (θn )
where it is convenient to let μ (θn ) denote the mean of Tn and n
denote the
variance of Tn .

(ii) μ (θn ) is differentiable at 0, and σ (θn ) −→ σ (0) as θn → 0 and

(iii) the test rejects for large values of the test statistic, that is

n (Tn − μ (0))
> zα
σ (0)

where zα is the upper 100 (1 − α) % point of a standard normal distribution.

Under these conditions, the corresponding power function converges to


√ √ 
n (Tn − μ (θn )) σ (0) n (μ (θn ) − μ (0))
βn (θn ) = Pθn ≥ zα −
σ (θn ) σ (θn ) σ (θn )
√ 
n (Tn − μ (θn )) σ (0) h (μ (θn ) − μ (0))
= Pθn ≥ zα −
σ (θn ) σ (θn ) σ (θn ) θn
−→ Φ (hc − zα ) ,

where  
1 (μ (θn ) − μ (0)) μ (0)
c = lim = .
n→∞ σ (θn ) θn σ (0)
The quantity c is called the slope of the test. The larger the slope, the more powerful
the test. A comparison of two tests Tn,1 , Tn,2 may then be made in terms of the ratio of

188
9. Efficiency

the slopes denoted by c1 , c2 , respectively


     
c1 μ1 (0) μ2 (0)
= / . (9.1)
c2 σ1 (0) σ2 (0)
c1
Hence, Tn,1 is preferred to Tn,2 if c2
> 1.

Example 9.1 (One-Sample Tests: Sign Test, t-Test, Wilcoxon Signed Rank Test).
We first calculate the slope of the sign test. Suppose that we have a random sample
of size n from some distribution having absolutely continuous cumulative distribution
function F (x − θ) which is symmetric around its median θ. We would like to test the
null hypothesis H0 : θ = 0 and the one-sided alternative H1 : θ > 0. Let Tn,1 be the
proportion of Xi > 0. The sign test rejects the null hypothesis for large values of Tn,1 . Set

p = P (X > 0) = 1 − F (−θ)

Then, Tn,1 has a binomial distribution with parameters n and probability of success p.
Hence,

μ (θ) = Eθ [Tn,1 ] = p,
p (1 − p)
V arθ [Tn,1 ] = .
n
It follows that σ 2 (θ) = p (1 − p) and the slope is

μ (0) 2f (0)
= , (9.2)
σ (0) σ

where f (x) = F  (x).


Suppose now we consider the usual t-test for H0 vs H1 which rejects the null hy-
pothesis for large values of the sample mean Tn,2 = X̄n . Hence,

μ (θ) = E [Tn,2 ]

= xdF (x − θ)
= θ

Since the test statistic is given by



nX̄n
,
Sn

189
9. Efficiency

where Sn2 represents the sample variance, we see that as n → ∞



Sn → σ =
2 2
x2 dF (x) .

It follows that
μ (0) 1
= . (9.3)
σ (0) σ
Hence, when f (x) is the standard normal density, the asymptotic relative efficiency of
the sign test to the t-test is
 2
2
2
(2f (0)) = √ = 0.637.

The Wilcoxon signed rank statistic rejects for large values of


n
Tn,3 = Ri+ I (Xi > 0) ,
i=1

where as in Section 5.2, Ri+ is the rank of |Xi | among the {|Xi | , i = 1, . . . , n} . It follows
that under the null hypothesis

n (n + 1)
E0 [Tn,3 ] = ,
4
n (n + 1) (2n + 1)
V ar0 [Tn,3 ] = .
24
The calculations of the mean and variance under the alternative for the Wilcoxon signed
rank statistic are more involved and are given by the following theorem.

Theorem 9.1. Let

p1 = P (X > 0)
p2 = P (X1 + X2 > 0)
p3 = P (X1 + X2 > 0, X1 > 0)
p4 = P (X1 + X2 > 0, X1 + X3 > 0) .

Then, we have,
n (n − 1)
E [Tn,3 ] = np1 + p2
2

190
9. Efficiency

and
n (n − 1)
V ar [Tn,3 ] = np1 (1 − p1 ) + p2 (1 − p2 )
2
+2n (n − 1) (p3 − p1 p2 ) + n (n − 1) (n − 2) p4 − p22

(p2 +p21 )
Proof. See page 47 of Hettmansperger (1994). Here p3 = 2
.

The probabilities in Theorem 9.1 can be approximated for values of θ close to 0


so that

p1 = 1 − F (−θ)
≈ 1 − [F (0) − θf (0)]
1
= − θf (0) .
2
Also,

p4 = P (X1 + X2 > 0, X1 + X3 > 0)


 2
= 1 − F (−x) f (x) dx
1
=
3
and

p2 = P (X1 + X2 > 0)
≈ 1 − F ∗ (−2θ)
1
= + 2θf ∗ (0)
2
where F ∗ is the convolution distribution and f ∗ is its density. Using the symmetry of f
whereby f (x) = f (−x), we have that
 ∞

f (0) = f 2 (x) dx.
−∞

It follows that for small values of θ


n (n + 1) (2n + 1)
V ar0 [Tn,3 ] ≈ .
24

191
9. Efficiency

In order to compute the slope for the Wilcoxon signed rank test, we note that

n (n − 1)
μ (θ) = E [Tn,3 ] = np1 + p2
2
so that  ∞

μ (0) = nf (0) + n (n − 1) f 2 (x) dx
−∞

and 
μ (0) √ ∞
= 12 f 2 (x) dx. (9.4)
σ (0) −∞

The calculations of the slopes in (9.2), (9.3), and (9.4) enable us to compute the
asymptotic relative efficiencies.

Example 9.2 (Two-Sample Tests: Mann-Whitney-Wilcoxon Test and t-Test). Suppose


that we have two independent random samples of sizes n and m respectively from dis-
tributions F (x) ,G(x − Δ) having densities f, g. Recalling the Mann-Whitney-Wilcoxon
test in Section 5.3.2, it was seen that the test statistic was the counting form

1 
m n
Tm,n = I (Xi < Yj )
mn j=1 i=1

which rejects the null hypothesis for large values. Then

μ (Δ) = E [Tm,n ]
= P (X < Y )
 ∞
= F (y) f (y − Δ) dy
−∞
 ∞
= F (y + Δ) dF (y)
−∞

and provided we can differentiate under the integral sign,


 ∞

μ (0) = f 2 (y) dy.
−∞

The variance of Tm,n , given from Theorem 5.1, is asymptotically

1  
V ar [Tm,n ] = mnq 1 (1 − q 1 ) + mn (n − 1) (q 2 − q 2
1 ) + mn (m − 1) (q 3 − q 2
1 )
(mn)2
 
≈ o (1) + m−1 (q2 − q12 ) + n−1 (q3 − q12 )

192
9. Efficiency

so that for m
m+n
→ λ > 0,

(m + n) V ar [Tm,n ] ≈ λ−1 (q2 − q12 ) + (1 − λ)−1 (q3 − q12 )


1
= ,
12λ (1 − λ)

using similar arguments following Theorem 9.1. It follows that the slope is given by
 ∞
μ (0) -
= 12λ (1 − λ) f 2 (x) dx. (9.5)
σ (0) −∞

By way of comparison, we may also compute the slope for the two-sample t-test
using the statistic which rejects the null hypothesis for large values. It is given by the

Table 9.1.: Relative efficiency of the Wilcoxon two-sample test relative to the t-test
Distribution Relative efficiency
Normal 3/π
Logistic π 2 /9
Uniform 1

ratio  
Ȳm − X̄n
√ ,
Sn,m m−1 + n−1
2
where X̄n , Ȳm represent the sample means and Sn,m represent the pooled sample variance
2
estimate of the common variance σ . It is straightforward to show that the slope of the
two-sample t-test is
-
μ (0) λ (1 − λ)
= . (9.6)
σ (0) σ

We record in Table 9.1 the relative efficiency of the Wilcoxon test relative to the
t-test for various underlying distributions.

9.2. Making Use of Le Cam’s Lemmas


The calculation of the asymptotic distribution under the alternative is not always as
straightforward as in the previous section. In what follows we shall consider examples
where it is necessary to calculate the asymptotic relative efficiency of tests by making use
of Le Cam’s contiguity concepts and lemmas. We begin with a simple example obtained
from Section 14.11 of van der Vaart (2007) involving the median test.

193
9. Efficiency

Example 9.3 (Median Test). Consider the two-sample test in Example 9.2 with m
observations in the first sample and n in the second. Set N = m + n. The median test
rejects the null hypothesis that the medians in the two samples are the same for large
values of  
1 
N
n+1
TN = I Ri ≤
N i=m+1 2

where R1 , . . . , RN are the ranks of the complete data and I (A) is the indicator function
of the set A. Under the null hypothesis,
  
√ n  1  m
1
N TN − = √ −n I F (Xi ) ≤
2N N N i=1
2
n  
1
+m I F (Yi ) ≤ Bigg) + oP (1) (9.7)
j=1
2
 
L λ (1 − λ)

→ N 0, (9.8)
4

since the left-hand side of (9.7) can be expressed as a linear rank statistic. The right-
hand side of (9.7) is the projection (see Exercise 3.8). On the other hand, for alternatives
θN = √hN , the log likelihood ratio given by
# # √
h 1 − λ  g
n
f (Xi ) g (Yj − θN ) h2 (1 − λ) Ig
log # # =− √ (Yi ) − + oP (1) (9.9)
f (Xi ) g (Yj ) n i=1
g 2

and hence is asymptotically normal. Moreover the joint distribution of the linear parts
of the right-hand sides of (9.7) and (9.9) is multivariate normal. Slutsky’s lemma im-
plies a similar result for the right-hand sides. Hence by Le Cam’s third lemma, under
alternatives θN = √hN ,

 
√ n  L λ (1 − λ)
N Tn − −
→N τ (h) , ,
2N 4

where   
f  (y)
τ (h) = −hλ (1 − λ) dF (y)
F (y)≤1/2 f (y)
is the asymptotic covariance. The slope of the median test is then given by
  
- 1/2
f  (v)
−2 λ (1 − λ) F −1 (v) dv. (9.10)
0 f (v)

194
9. Efficiency

9.2.1. Asymptotic Distributions Under the Alternative in the


General Multi-Sample Problem: The Spearman Case
In this section, we derive the asymptotic efficiency for the Spearman statistic in the
general multi-sample problem with ordered alternatives from Section 6.1.2. Let f (x)
be the corresponding density of F (x) and let ϕ (v) be a square integrable function on
(0, 1) . Assume

F −1 (v) = inf {x : F (x ≥ v)}

 ∞  2
f  (x)
I (f ) = f (x) dx < ∞ (9.11)
−∞ f (x)

f  (F −1 (v))
ϕ (v, f ) = ,0 < v < 1
f (F −1 (v))

Let F1 (x) , . . . , Fr (x) be r continuous distribution functions. Suppose we wish to test



H0 : Fr (x) = . . . = F1 (x) = F x − d¯

against the alternative

H1 : Fk (x) = Fk (x − Δk ) , 1 ≤ k ≤ r,

where
1 
Nr
Δ1 < . . . < Δr , ¯
d= di
Nr i=1
and
di = Δk , Nk−1 < i ≤ Nk . (9.12)
We shall also assume that as min {n1 , . . . , nr } → ∞, the following regularity conditions
hold under the alternative

Nr1/2 Δk → δk (9.13)

2
max di − d¯ → 0 (9.14)
1≤i≤Nr


Nr
2
I (f ) di − d¯ → b2 , a finite constant. (9.15)
i=1

Notation The notation T1 ∼ T2 indicates that the statistics T1 , T2 are asymptotically.

195
9. Efficiency

We may determine the locally most powerful test for this problem by using Hoeffding’s
lemma 8.1. In fact, provided  ∞
|f  (x)| dx < ∞,
−∞

the locally most powerful test is given by the statistic


Nr

T0 = − di − d¯ ϕ Xi − d,
¯f (9.16)
i=1


Nr
f  Xi − d¯
= − di − d¯ (9.17)
i=1
f Xi − d¯

which under H1 has an asymptotic normal distribution with mean 0 and variance


Nr
2
b2 = I (f ) di − d¯ . (9.18)
i=1

We shall make use of the following two theorems.


Theorem 9.2 (Hájek and Sidak (1967), Theorem a V.1.4). Let {Vi }n1 be i.i.d. uniform
random variables on the interval (0, 1) . Suppose ϕ (v) , 0 < v < 1 is a square integrable
function such that  1
(ϕ (v))2 dv < ∞
0

and put
aϕn (i) = E {ϕ (V1 ) |π (1) = i} , 1 ≤ i ≤ n.
Then
lim E {aϕn (π (1)) − ϕ (V1 )}2 = 0.
n→∞

Theorem 9.3 (Hájek and Sidak (1967), Theorem VI.2.4). Let


n
Sn = (ck − c̄) an (μ (i)) ,
i=1

where the scores {an (i)} satisfy


 1
lim {an (1 + [vn]) − ϕ (v)}2 dv = 0,
n→∞ 0

and the {ck } satisfy Noether’s condition. Then under the regression alternatives,

Sn − m n L

→ N (0, 1) ,
σn

196
9. Efficiency

where
 
n
1
mn = (ck − c̄) di − d¯ ϕ (v) ϕ (v, f ) dv
i=1 0


n  1
2
σn2 = (ck − c̄) (ϕ (v) − ϕ̄)2 dv.
i=1 0

Corollary 9.1. Consider the Spearman statistic for the multi-sample ordered location
problem
 μ (i)
Sr = c (i)
Nr + 1
where for 1 ≤ k ≤ r
c (i) = Nk−1 + Nk , Nk+1 < i ≤ Nk .
Then as min {n1 , . . . , nr } → ∞, Sr is, under the alternative, asymptotically normal with
mean and variance given respectively by

Nr2 
r N 1
mr = + (ck − c̄) di − d¯ vϕ (v, f ) dv
2 i=1 0

Nr3 
σr2 = wk Wk Wk−1
12
with
nk  k
→ w k , Wk = wi .
Nr i=1

Proof. This is a direct application of Theorem 9.3.

We are now in a position to compute the asymptotic efficiency of the Spearman


statistic in the multi-sample ordered problem. We shall show that the asymptotic effi-
ciency is given by the ratio
( 1 )2
r
wk δk (Wk−1 + Wk − 1) 0 vϕ (v, f ) dv
k=1
ARE = + 1 r , ( r r 2
) (9.19)
I (f ) 12 k=1 wk Wk−1 Wk k=1 wk (δk − k=1 wk δk )

In fact, from Corollary 9.1,


Nr

r
¯
(ck − c̄) di − d = Nr
3/2
wk δk (Wk + Wk−1 − 1)
i=1 i=1

197
9. Efficiency

and

Nr
2 
Nr  2
¯
di − d = w k δk − w k δk
i=1 i=1

Hence, the asymptotic power is given by

⎛ r  ⎞
  1
(Wk−1 + Wk − 1) 0 vϕ (v, f ) dv
Sr − E0 Sr k=1 wk δk
lim P √ > kα = 1 − Φ ⎝kα −
 ⎠ (9.20)
min{ni }→∞ V ar0 Sr 1 r
12 k=1 w k W k−1 W k

On the other hand, the asymptotic power of the locally most powerful test defined
by the optimal score function is given by
⎧  2 ⎫
⎨r 
r ⎬
I (f ) w k δk − w k δk .
⎩ ⎭
k=1 k=1

In Table 9.2, we record various integrals which enable us to compute the efficiencies.
To illustrate the calculation, note that for the standard normal, using a change of variable
and integration by parts,
 1/2  1/2

uϕ (u) du = uΦ−1 (u) du


0 0
 0
1 v2
= √ Φ (v) ve− 2 dv
2π −∞
 0  0  2
1 − u2 1 − v2
= Φ (u) √ e 2 + √ e 2 dv
2π −∞ −∞ 2π
1 1
= − √ + √ .
2 2π 4 π

A similar calculation then shows


 1  1/2  1
uϕ (u) du = uϕ (u) du +
uϕ (u) du
0 0 1/2
   
1 1 1 1
= = − √ + √ + √ + √
2 2π 4 π 2 2π 4 π
1
= √ .
2 π

198
9. Efficiency

9.2.2. Asymptotic Distributions Under the Alternative in the


General Multi-Sample Problem: The Hamming Case
Theorem 9.3 provides the asymptotic distribution under the alternative for simple lin-
ear rank statistics in the multi-sample case with ordered alternatives. For other test
statistics such as the one generated by Hamming distance, we need to consider general-
ized rank statistics. To this end, we considerthe following theorems. First, we exhibit

Table 9.2.: Computations of integrals


Double
Normal Logistic
exponential
I(f ) σ −2 1 1
1 3
√1 1 1
1/2 ϕ (u) du
1 σ 2π 2 4

1 uϕ (u) du
√1 + 4√1 π ≈ 0.3405 3 5
 1//22 2 2π 8 24
uϕ (u) du − 2√12π + 4√1 π ≈ −0.0585 − 18 − 241
 01 2
√1 + 1
arctan 1/√2 + π ≈ 0.2961 7 17
1 u ϕ (u) du
3/2
4 2π 2(π) 2
 1//22 2 √
24 96
u ϕ (u) du − 4√12π − 1 3/2 arctan1/ 2 + π2 + 2√1 π ≈ −0.0141 − 241
− 961
0 2(π)
v2
1
Normal: f (x) = √2πσ e− 2σ2 , −∞ < x < ∞
Double exponential: f (x) = 12 e−|x| , −∞ < x < ∞
−2
Logistic: f (x) = e−x (1 + e−x ) , −∞ < x < ∞.

the asymptotic equivalence of a linear rank statistic to a sum of independent random


variables in the multi-sample case.
Theorem 9.4. Let {Vi }n1 be i.i.d. uniform random variables on the interval (0, 1) . Set


Nr

T1 = di − d¯ ϕ (Vi , f ) ,
i=1
  
Nr
μ (i)
T2 = di − d¯ ϕ ,f .
i=1
Nr + 1

Then under Theorem 9.2


E0 (T1 − T2 )2 → 0
as min {n1 , . . . , nr } → ∞ and hence T1 ∼ T2 under H0 .
Proof. See Alvo and Pan (1997).
Theorem 9.5. Let

Nr

T1 = di − d¯ ϕ (Vi , f )
i=1

199
9. Efficiency

and

n
f (x − di )
log Ln = log
i=1
f s − d¯
Then  
b2 P
log Ln − T1 + −
→0
2
2

as min {n1 , . . . , nr } → ∞. Moreover, log Ln is asymptotically N − b2 , b2 .

Proof. The asymptotic normality of T1 follows from the Lindeberg condition. See The-
orem VI.2.1 of Hájek and Sidak (1967) for details of the proof.

The next theorem provides the general result which is useful for obtaining the non-
null distribution of test statistics involving more general distance functions such as that
of Hamming.

Theorem 9.6. Consider a generalized linear rank statistic


Nr
T = aiπ(i) ,
i=1

and suppose that under H0


T ∼ N (0, 1) .
Let
d (i, j) = aij − āi. − ā.j + ā..
with
max d (i, j) ≈ O (Nrp ) , p < 0
1≤i,j≤Nr

Assume as well I (f ) < ∞ holds. Then under the alternative H1 the asymptotic distri-
bution of T is N (σ12 , 1) where
σ12 = E0 [T T2 ]
and  

Nr
π (i)
T2 = di − d¯ ϕ ,f .
i=1
Nr + 1

Proof. The proof follows closely the proof in Theorem VI.2.4 of Hájek and Sidak (1967).
Recall


Nr

T1 = di − d¯ ϕ (Vi , f ) .
i=1

200
9. Efficiency

Under H0 from Theorem 9.4, T1 and T2 are asymptotically equivalent. Under H0 , The-
orem 9.5 then implies
b2
log LNr ∼ T1 − .
2
It follows that under H0
   
b2 b2
(T, log LNr ) ∼ T, T1 − ∼ T, T2 − .
2 2

We can see that T2 ∼ N (0, b2 ) and hence if


   
0 1 σ12
(T, T2 ) ∼ N2 , ,
0 σ12 b2

then from Le Cam’s third lemma, it will follow that under the alternative

T ∼ N (σ12 , 1) .

In view of the Cramér-Wold device (Section 2.1.2), it remains to show that for
arbitrary constants c1 , c2
c1 T + c2 T2 ∼ N ormal.
For that purpose we make use of Hoeffding’s combinatorial (see Section 3.3) central limit
theorem. Let  

j
¯
aij = c1 aij + c2 di − d ϕ ,f .
Nr + 1
Then

Nr
c1 T + c2 T2 = a∗iπ(i) .
i=1

Put

d∗ (i, j) = a∗ij − ā∗i. − ā∗.j + ā∗..


   
j
= c1 d (i, j) + c2 di − d ϕ ¯ ,f − ϕ̄ (., f ) .
Nr + 1

Since ϕ (v, f ) is an integrable function, there exists a constant M > 0 such that
|ϕ (v, f ) | ≤ M, a.s.. Also,
  
 n Δ 
|di − d| ≤ max Δk − 
¯ k k
1≤k≤r Nr 
  
 tδk 
≈ Nr−1/2
max δk −
1≤k≤r Nr 

≈ O Nr−1/2 .

201
9. Efficiency

Hence
    
 j 
∗  ¯
|d (i, j)| ≤ |c1 d (i, j)| + c2 di − d ϕ , f − ϕ̄ (., f ) 
N +1
; r ;
≤ |c1 | max |d (i, j)| + |c2 M | di − d¯;
;
1≤i,j≤Nr

≈ O (Nrp ) + O Nr−1/2 ≈ o (1) .

On the other hand,

1  ∗
V ar0 {c1 T + c2 T2 } = (d (i, j))2
Nr − 1
→ c2

and
max1≤i,j≤Nr (d∗ (i, j))2
 ∗ →0
1
Nr −1
(d (i, j))2
This completes the proof.

Applying Theorem 9.6 with p = − 12 , we see that Hamming’s statistic Hr is, under
the alternative, asymptotically normal with mean

mH = 1 + E0 [Hr T2 ]

and variance, as in (6.5)


2 r−1
σH =
Nr − 1
The calculation of the mean is shown in the next lemma.

Lemma 9.1. For the Hamming distance,


r  Wk
−1/2
E0 [Hr T2 ] ≈ N δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 Wk−1

Proof. We may write


   
μ (i)
E0 [Hr T2 ] = E0 aiμ(i) ¯
di − d ϕ ,f
Nr + 1
N    
1 r 
Nk r
j
= aij di − d¯ ϕ , f − ϕ̄ (., f )
Nr − 1 k=1 i=N +1 j=1 Nr + 1
k−1
⎧ ⎫
 ⎨     ⎬
1
r Nkr
1 j
= nk Δk − d¯ ϕ , f − ϕ̄ (., f )
Nr − 1 k=1 ⎩j=N +1 nk Nr + 1 ⎭
k−1

202
9. Efficiency
⎧ ⎫
 ⎨     ⎬
1
r
Nkr
j
= Δk − d¯ ϕ , f − ϕ̄ (., f )
Nr − 1 k=1 ⎩ N r +1 ⎭
j=Nk−1 +1
 

r 
r Wk
≈ N − /2
1
δk − w k δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 j=1 Wk−1

r  Wk
= N − /2
1
δk [ϕ (v, f ) − ϕ̄ (f )] dv
k=1 Wk−1

We may specialize these results for the two-sample case. In particular, we may
calculate for r = 2 and sample sizes n1 , n2 with n1n+n
1
2
→ λ,
 −1/2   
bI −1/2(f ) λ 1 λ
mH = 1 + ϕ (v, f ) dv − ϕ (v, f ) dv ,
n1 + n2 1−λ λ 0

and
2 1
σH = .
n1 + n2 − 1
It now follows that the asymptotic relative efficiency for the Hamming distance is
given by ( )2
r  Wk
k=1 δ k Wk−1
[ϕ (v, f ) − ϕ̄ (f )] dv
ARE = ( r ). (9.21)
r 2
(r − 1) I (f ) w
k=1 k (δ k − w
k=1 k k δ )

9.3. Asymptotic Efficiency in the Unordered


Multi-Sample Test
In the unordered multi-sample problem, the test statistic for the Spearman distance has
an asymptotic χ2 distribution under both the null and alternative hypothesis. For that
reason, it is necessary to redefine the notion of efficiency. Suppose that we are interested
in testing
H0 : θ = 0,
against
H1 : θ = 0.
Consider two test statistics Q1 , Q2 for which

L χ2k under H0
Qi −

χ2k (δi2 ) under H1

203
9. Efficiency

where δi2 is a noncentrality parameter for i = 1, 2 respectively. Suppose that Q1 is the


locally most powerful test. Then we may define the asymptotic efficiency of Q2 relative
to Q1 to be the ratio  2
δ2
e= .
δ1
Suppose instead that we have a situation where the degrees of freedom under the
null and alternative hypotheses are different. In that case, we may use Theorem VI.4.6
of Hájek and Sidak (1967) which states that when the noncentrality parameter δ → 0,
then the power function under the alternative is given by
 2   −1
2 −(n+2)/2
χα,n n+2 2 n/2
P χ2k δ ≥ χα,n − α ≈ δ2
2
exp − Γ χα,n .
2 2

On the other hand, suppose that the maximum asymptotic power can be reached by the
test based on χ2r−1 (b2 ) . Hence the asymptotic relative efficiency can be defined to be
approximately
 2  2 n/2
P χ2k (δ 2 ) ≥ χ2α,n − α χα,r−1 − χ2α,n Γ r+2 χα,n δ
2 ≈ 2 (r−1−n)/2
exp 2
n+2 .
P χk (b2 ) ≥ χ2α,n − α 2 Γ 2 χ2α,r−1
(r−1)/2 b2
(9.22)
In the case where r = 2,
b2 = I (f ) w1 w2 (δ1 − δ2 )2 .

Theorem 9.7. The Spearman statistic for the multi-sample unordered problem is asymp-
totically chi-square χ2r−1 (δS2 ) with noncentrality parameter
 1 2
δS2 = 12b 2
vϕ (v, f ) dv /I (f ) .
0

Proof. See Alvo and Pan (1997).

Theorem 9.8. The Hamming statistic for the multi-sample unordered problem is asymp-
totically chi-square χ2(r−1)2 (δH
2
) with noncentrality parameter

2 −1
δH = E [μ̄] ΣH E [μ̄] ,

where μ̄ was defined in Theorem 6.2.

Proof. See Alvo and Pan (1997).

Theorems 9.7 and 9.8 permit us to calculate the asymptotic relative efficiencies in
the unordered case. It can be shown that these are the same as for the ordered situation
and hence we have the same results as above.

204
9. Efficiency

9.4. Exercises
Exercise 9.1. Show that for the two-sample problem the asymptotic relative efficiency
for the normal density,
 
1 x2
f (x) = √ exp − 2 , −∞ < x < ∞, I (f ) = σ −2
2πσ 2σ

is ARE = 3
π
≈ 0.9554.

Exercise 9.2. Show that for the two-sample problem the asymptotic relative efficiency
for the double exponential,

1
f (x) = exp (− |x|) , −∞ < x < ∞, I (f ) = 1
2
3
is ARE = 4
= 0.75.

Example 9.4. Show that for the two-sample problem the asymptotic relative efficiency
for the logistic distribution,
−2
f (x) = e−x 1 + e−x , −∞ < x < ∞, I (f ) = 1/3

is ARE = 1.

205
Part III.

Selected Applications

207
10. Multiple Change-Point
Problems

10.1. Introduction
In the classical formulation of the single change-point problem, there is a sequence
X1 , . . . , Xn of independent continuous random variables such that the Xi for i ≤ τ have
a common distribution function F1 (x) and those for i > τ a common distribution F2 (x).
It is of interest to test the hypothesis of “no change,” i.e., τ = n against the alternative
of a change, 1 ≤ τ < n.
We begin by formulating the problem in terms of a parametric framework and study
the properties of the new model. We then construct a composite likelihood function
which permits us to conduct tests of hypotheses based on a score statistic to assess
the significance of the change-points. We demonstrate the consistency of the estimated
change-point locations and present a binary segmentation algorithm to search for the
multiple change-points. We then report on a number of simulation experiments in order
to compare the performance of the proposed method with other methods in the literature.
We apply the new method to detect the DNA copy number alterations in a human
genomic data set and to identify the change-points on an interest rate time series. Our
empirical results reveal that the proposed method is efficient for change-point detection
even when the data are serially correlated.

10.2. Parametric Formulation for Change-Point


Problems
Suppose that there exists a single change-point between two (not necessarily adjacent)
segments in a sequence of independent random variables and let X and Y be any two
random variables from the respective segments, X from the first and Y from the second.
The null hypothesis states that there is no change-point from one segment to the next.
Consider a kernel function t = h (x, y) and let the density of T = h (X, Y ) be given by

π (t; θ) = exp [θt − K(θ)] f0 (t) (10.1)

© Springer Nature Switzerland AG 2018 209


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_10
10. Multiple Change-Point Problems

where f0 (t) is the density of T under the null hypothesis and K (θ) is a normalizing
constant. This is an example of exponential tilting where the first factor in (10.1)
represents the alternative to the null hypothesis. Consider the specific case where the
kernel is given by
h (x, y) = sgn (x − y) . (10.2)
When θ = 0, there is no change-point and sgn (x − y) = ±1 with equal probability 1
2
irrespective of the underlying common distribution of X and Y . Hence

1
f0 (t) = , t = ±1.
2
and the normalizing constant K(θ) is calculated to be

K(θ) = ln(cosh(θ)).

We may express the null hypothesis in terms of the parameter θ

H0 : θ = 0

and in that case K (0) = 0. The kernel in (10.2) appears in the Mann–Whitney statistic
when testing for a change in mean between two distributions. The use of different kernel
functions allows us flexibility in measuring the change between two segments.

10.2.1. Estimating the Location of a Change-Point using a


Composite Likelihood Approach
Consider a sequence of N independent observations Z1 , . . . , ZN where there is a change-
point that breaks the sequence into two segments: observations X τ = {Z1 , Z2 , . . . , Zτ },
and Y τ = {Zτ +1 , Zτ +2 , . . . , ZN }. Instead of setting up the multivariate distribution of
all N observations, we here adopt the composite likelihood approach based on the kernel
defined in (10.2). Let {tij } be the collection of kernel values

tij = sgn (zj − zi ) , i = 1, . . . , τ, j = τ + 1, . . . , N.

We define the composite likelihood as

"
τ "
N
L(θ; X τ , Y τ ) = fT (tij ; θ)
i=1 j=τ +1

210
10. Multiple Change-Point Problems

using the density in (10.1). Hence the composite log-likelihood, apart from a constant,
is given by

τ 
N
(θ; X τ , Y τ ) = θ sgn(zi − zj ) − τ (N − τ )K(θ). (10.3)
i=1 j=τ +1

Given τ , maximizing (θ; X τ , Y τ ) with respect to θ leads to the estimate of θ:


 

1 
τ 
N
θ̂(τ ) = tanh−1 sgn(zi − zj ) . (10.4)
τ (N − τ ) i=1 j=τ +1

A change-point location τ̂ is then estimated as

τ̂ = argmax (θ̂(τ ); X τ , Y τ ).
τ

In the following lemma, we prove the almost sure convergence of this statistic.

Lemma 10.1. Consider the model (10.1) and let Xi , i = 1, . . . , m and Yj , j = 1, . . . , n


 n
be two sequences of independent random variables. Let Um,n = m i=1 j=1 Sij , where
Sij = sgn (Xi − Yj ). Let μXY = E [sgn (X − Y )]. Then as min{m, n} −→ ∞,

Um,n a.s.
−→ μXY .
mn
Proof. Using the result in (Lehmann (1975), p. 335),
 
V ar(Um,n ) = V ar(Sij ) + Cov(Sij , Skl )
i,j ijk
= mnV ar(Sij ) + mn (m − 1) Cov(Sij , Skj ) + mn (n − 1) Cov(Sij , Si )
≤ mn(m + n − 1)M,

where M is an upper bound for V ar(Sij ). It follows from Chebyshev’s inequality that
for each  > 0 we have as min{m, n} −→ ∞,
 
1 2M
P (|Um,n − E(Um,n )| > mn) ≤ 2 2 2 V ar(Um,n ) = O →0
mn min{m, n}2

P
and hence Um,n − E(Um,n ) −→ 0. We have for subsequences {m2 } , {n2 },
  
 2M
P |Um2 ,n2 − E(Um2 ,n2 )| > m2 n2  ≤ O < ∞.
m n m n
mn min{m, n}2

211
10. Multiple Change-Point Problems

Hence by the Borel-Cantelli Lemma 2.2, we have



P |Um2 ,n2 − E(Um2 ,n2 )| > m2 n2  i.o. = 0

and for the subsequences {m2 } , {n2 },

Um2 ,n2 − E(Um2 ,n2 ) a.s.


−→ 0. (10.5)
m 2 n2
To show there is little difference between the sequences and the subsequences, let

Dm,n = max |Uk1 ,k2 − E(Uk1 ,k2 ) − (Um2 ,n2 − E(Um2 ,n2 ))| ,

where the maximum is taken over m2 ≤ k1 < (m + 1)2 ,n2 ≤ k2 < (n + 1)2 . From
Chebyshev’s inequality it also follows similarly that
 
1 64M
P Dm,n > m n  i.o. ≤ 4 4 2 V ar(Dm,n ) ≤ O
2 2
,
mn (min{m, n})2 2

and hence
Dm,n a.s.
−→ 0. (10.6)
m 2 n2
Note that for m2 ≤ k1 < (m + 1)2 and n2 ≤ k2 < (n + 1)2 ,

|Uk1 ,k2 − E(Uk1 ,k2 )| |Um2 ,n2 − E(Um2 ,n2 )| + Dm,n


≤ ,
k1 k2 m 2 n2

and the proof of almost sure convergence then follows by using (10.5) and (10.6).

10.2.2. Estimation of Multiple Change-Points


The estimation of multiple change-points can be done by applying the above method
recursively via a binary segmentation algorithm. The main idea is to recursively de-
tect a new change-point in one of the segments generated from those change-points
estimated previously. Consider we currently have k − 1 change-points estimated at lo-
cations τ̂1 , τ̂2 , . . . , τ̂k−1 with 0 = τ̂0 < τ̂1 < . . . , < τ̂k−1 < τ̂k = N , resulting in a partition
of the N observations into k segments S1 , . . . , Sk , where Si = {Zτ̂i−1 +1 , . . . , Zτ̂i }. Now
we can apply the composite likelihood method of estimating a single change-point men-
tioned in Section 10.2.1 to the observations in each of the k segments. Suppose that in
the i th segment, the composite likelihood (10.3) is maximized at a proposed change-
point location τ̂ (i), which partitions the segment into two other segments denoted by
X τ̂ (i) and Y τ̂ (i) with sizes mi = τ̂ (i) − τ̂i−1 and ni = τ̂i − τ̂ (i). The location of the k th
estimated change-point is the one with the largest scaled composite likelihood:

212
10. Multiple Change-Point Problems

τ̂ (i∗ ) = argmax Q(X τ̂ (i) , Y τ̂ (i) ), (10.7)


τ̂ (i),i=1,...,k

Q(X τ̂ (i) , Y τ̂ (i) ) = (θ̂(τ̂ (i)); X τ̂ (i) , Y τ̂ (i) )/(mi + ni ). (10.8)

The purpose of the scaling in (10.7) is to give consistent estimates of the change-point
locations (see Section 10.3). In searching for K change-points, this binary segmentation
procedure has a computation cost of order O(KN ln N ) as compared to a standard grid
search procedure with a computation cost of O(K2N ). We can further speed up the
procedure by searching for the change-points for the segments in parallel.

10.2.3. Testing the Significance of a Change-Point


The binary segmentation algorithm proposed in the previous section requires specifying
in advance the number of change-points which is generally unknown in practice. It is
suspected that some estimated change-points are not significant and can be removed. In
this section, a score test is proposed to determine the statistical significance of a change-
point. The test can also serve as a stopping rule in searching for a new change-point in
the binary segmentation algorithm.
Suppose that k − 1 change-points have been identified. We wish to examine the
significance of the k th estimated change-point, as defined by (10.7). This problem
is equivalent to performing a test for H0 : θ = 0 under model (10.1) based on the
observations in the i∗ th segment. We make use of the test statistic:

U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) )2
,
I(0)

where U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) and I(0) are the score function and Fisher information respec-
tively evaluated at θ = 0:
 τ̂ (i∗ )
∂(θ(τ̂ (i∗ )); X τ̂ (i∗ ) , Y τ̂ (i∗ ) )    τ̂i∗
U (X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) =  = sgn(zi − zj ),
∂θ θ=0 ∗
i=τ̂i∗ −1 +1 j=τ̂ (i )+1
 2 
∂ (θ(τ̂ (i )); X τ̂ (i∗ ) , Y τ̂ (i∗ ) ) 

I(0) = − E  = mi∗ ni∗ K  (0) = mi∗ ni∗ .
∂θ2 θ=0

We reject H0 for large values of the test statistic. For a fixed change-point location, the
test statistic has an asymptotic chi-square distribution with 1 d.f. under H0 . However,
the maximum of the test statistics among all the possible change-point locations does
not follow a chi-square distribution. We may instead compute the p-value of the test
statistic through a permutation test. Since under H0 , the X  s and Y  s are independent
and identically distributed, we can permute the observations in the segment and calculate
the test statistic for each permutation sample. Usually we can calculate the exact p-value

213
10. Multiple Change-Point Problems

if we consider all possible permutations. However, listing all possible permutations


would be time-consuming for large samples. Instead, we may consider 1000 random
permutations to obtain an approximate p-value.
We can conduct the permutation test at each step of including a new change-point.
If the p-value of the test is larger than a prespecified level of significance, the binary
segmentation algorithm stops. Otherwise, we continue to search for another new change-
point.

10.3. Consistency of the Estimated Change-Point


Locations
10.3.1. The Case of Single Change-Point
In the case of a single point and a given τ , the scaled composite log-likelihood in (10.7)
evaluated at the estimate θ̂(τ ) stated in (10.4) can be written as

  τ N
τ (N − τ ) 1 1
Q(X τ , Y τ ) = ŵ tanh−1 (ŵ) + ln(1 − ŵ2 ) , ŵ = sgn(zi − zj ).
N 2 τ (N − τ ) i=1 j=τ +1

We define the estimate τ̂N of the change-point location as

τ̂N = arg max Q(X τ , Y τ ).


τ

The following lemma shows that τ̂N is a strongly consistent estimator for a single change-
point location. Here we assume that μXY = E (sgn(X − Y )) < 1. Otherwise, it is trivial
that X is always greater than Y .

Lemma 10.2. Suppose there is a single change-point. Let γ be the true proportion of
observations belonging to the segment defined under model (10.1). Let {δN } be a sequence
of positive numbers such that δN → 0, N δN → ∞, as N → ∞. Then for N large enough,
γ ∈ [δN , 1 − δN ] and for all  > 0,
   
 τ̂ 
P lim γ −  <  = 1.
N
N →∞ N

Proof. For any γ̃ ∈ [δN , 1 − δN ], let X(γ̃) = {Z1 , . . . , Z γ̃N } and Y (γ̃) = {Z γ̃N +1 , . . . ,
ZN }. Then as N −→ ∞,
 
a.s. γ 1−γ
ŵ −→ I(γ̃ ≥ γ) + I(γ̃ < γ) μXY = w(γ̃),
γ̃ 1 − γ̃

214
10. Multiple Change-Point Problems

uniformly in γ̃. As Q is a continuous function of γ̃, it can be shown that as N −→ ∞,

1 a.s.
Q(X(γ̃), Y (γ̃)) −→ γ̃(1 − γ̃)h(w(γ̃)),
N

uniformly in γ̃, where h(a) = a tanh−1 (a) + 12 ln(1 − a2 ). Notice |w(γ̃)| < 1 and applying
the Taylor series expansion of h at a = 0, there exists a large K such that with a = w(γ̃),

a 2 a4 a2K
h(a) = + + ··· + ,
2 12 2K(2K − 1)

and hence we have


K  2k 
μ2k γ (1 − γ̃) (1 − γ)2k γ̃
γ̃(1 − γ̃)h(w(γ̃)) = XY
I(γ̃ ≥ γ) + I(γ̃ < γ) .
k=1
2k(2k − 1) γ̃ 2k−1 (1 − γ̃)2k−1
(10.9)

2k−1 is monotonic decreasing in γ̃ for γ̃ ≥ γ while


It is easy to show that for k ≥ 1, γ̃(1−γ̃)
γ̃
(1−γ̃)2k−1
is monotonic increasing in γ̃ for γ̃ < γ. Therefore, γ̃(1 − γ̃)h(w(γ̃)) attains
its maximum when γ̃ = γ. The rest of the proof proceeds analogously to the proof of
Theorem 1 of Matteson and James (2014).

10.3.2. The Case of Multiple Change-Points


Let us first consider that there are two change-points, γ = {γ (1) , γ (2) } such that the se-
quence of N observations is partitioned into 3 segments X γ = {Z1 , . . . , Z γ (1) N }, W γ =
{Z γ (1) N +1 , . . . , Z γ (2) N } and Yγ = {Z γ (2) N +1 , . . . , ZN } and any two random variables
obtained from two distinct segments satisfy model (10.1). Let μXY = E(sgn(X − Y )).
Similarly, we can define μXW and μW Y . Assume that at least one of μXY , μXW and μW Y
has its absolute value less than one.

Lemma 10.3. Consider any change-point γ̃ such that γ (1) ≤ γ̃ ≤ γ (2) . Then the se-
quence of N observations is partitioned into 2 segments X(γ̃) = {Z1 , . . . , Z γ̃N } and
 N
Y (γ̃) = {Z γ̃N +1 , . . . , ZN }. Let ŵ = γ̃N (N1− γ̃N ) γ̃N
i=1 j= γ̃N +1 sgn(zi − zj ). As
N −→ ∞,
a.s.
sup |ŵ − p(γ̃)| −→ 0,
γ̃∈[γ (1) ,γ (2) ]

where

γ (1) (γ (2) − γ̃) γ (1) (1 − γ (2) ) (γ̃ − γ (1) )(1 − γ (2) )


p(γ̃) = μXW + μXY + μW Y . (10.10)
γ̃(1 − γ̃) γ̃(1 − γ̃) γ̃(1 − γ̃)

The proof is similar to Lemma 10.2 and is therefore omitted.

215
10. Multiple Change-Point Problems

As Q is a continuous function of γ̃, it can be shown that as N −→ ∞,


a.s.
Q(X(γ̃), Y (γ̃))/N −→ γ̃(1 − γ̃)h(p(γ̃)) = q(γ̃),

uniformly in γ̃.

Theorem 10.1. For AN ⊂ (δN , 1 − δN ) and real x, define

d (x, AN ) = inf {|x − y| : y ∈ AN } .

Let τ̂N be the estimated change-point and set

AN = {y ∈ [δN , 1 − δN ] : q (y) ≥ q (γ) , ∀γ} .

Then  
τ̂N a.s.
d , AN −→ 0, as N → ∞.
N
aγ̃+b
Proof. Note that p(γ̃) in (10.10) can be rewritten in a form: p(γ̃) = γ̃(1−γ̃) for some a
and b, and |p(γ̃)| < 1. Applying the Taylor series expansion of h at a = 0, there exists
a large K such that
q(γ̃) = γ̃(1 − γ̃)h(p(γ̃))
can be approximated by


K
μ2k (aγ̃ + b)2k
XY
q(γ̃) = .
k=1
2k(2k − 1) (γ̃(1 − γ̃))2k−1

It is easy to show that q(γ̃) is continuously differentiable and strictly convex as the sums
and products of convex functions are also convex.
Hence, for any two points γ̃1 , γ̃2 ∈ [γ (1) , γ (2) ], there exists a c > 0 such that

|q(γ̃1 ) − q(γ̃2 )| > c |γ̃1 − γ̃2 | + o (|γ̃1 − γ̃2 |) .

The rest of the proof proceeds analogously to the proof of Theorem 2 of Matteson and
James (2014).

Finally, the extension of the consistency proof for multiple change-points is straight-
forward by noting that in the case of multiple change-points, p(γ̃) in (10.10) is still
aγ̃+b
represented in the form: γ̃(1−γ̃) .

216
10. Multiple Change-Point Problems

10.4. Simulation Experiments


In this section, we report on the simulation experiments conducted to study the perfor-
mance of the proposed algorithm.

10.4.1. Model Setup


Consider the Blocks data sets (Donoho and Johnstone, 1995) which are generally
considered difficult for multiple change-point detection problems in view of the highly
heterogeneous segment levels and lengths. We generate the Blocks data sets with 11
change-points and N samples:


11
Zi = hj J(N ti − τj ) + σεi J(x) = {1 + sgn(x)} /2
j=1
{τj /N } = {0.1, 0.13, 0.15, 0.23, 0.25, 0.40, 0.44, 0.65, 0.76, 0.78, 0.81}
{hj } = {2.01, −2.51, 1.51, −2.01, 2.51, −2.11, 1.05, 2.16, −1.56, 2.56, −2.11}

where the ti ’s are equally spaced in [0, 1] and {τj /N } marks the segment’s information.
The position of each change-point is τj and the mean difference is controlled by hj .
Various i.i.d. distributions for εi are considered in our simulation experiments:
N (0, 1): standard normal distribution, t(2): Student’s t distribution with two degrees
of freedom, χ2 (3): chi-squared distribution with three degrees of freedom, Cauchy(0, 1):
Cauchy distribution with location 0 and scale 1, P areto(α = 0.5), P areto(α = 1.5):
Pareto distributions with α = 0.5, and 1.5 and LN (0, 1): log-normal distribution with
location 0 and scale 1. In our simulation, N is set to be 500 and 1000 with σ = 0.5. Ex-
amples of simulated model with various error distributions can be found in Figure 10.1.

10.4.1.1. Performance Measures


In order to compare the performances of various change-point methods, we consider two
distance measures between the estimated change-point set Γ̂est and the true change-point
set Γtrue :

ξ1 (Γ̂est , Γtrue ) = max min |a − b| ,


b∈Γtrue a∈Γ̂est

ξ2 (Γtrue , Γ̂est ) = max min |a − b| .


b∈Γ̂est a∈Γtrue

Note that ξ1 (Γ̂est , Γtrue ) measures the under-segmentation error and ξ2 (Γtrue , Γ̂est ) mea-
sures the over-segmentation error. We also calculate the sum of ξ1 and ξ2 as a measure
of the total change-point location error.

217
10. Multiple Change-Point Problems
2
Model with Normal(0,1) error Model with (3) error
5 10

Simulated Data
Simulated Data
4 8
3
6
2
4
1
2
0
-1 0

-2 -2
0 100 200 300 400 500 0 100 200 300 400 500
Position Position

Model with Laplace(0,1) error Model with Cauchy(0,1) error


6 30
Simulated Data

Simulated Data
20
4
10

2 0

-10
0
-20

-2 -30
0 100 200 300 400 500 0 100 200 300 400 500
Position Position

Figure 10.1.: Examples of simulated independence model introduced by Donoho and


Johnstone (1995) with different error terms. The green line showed the
true mean of each segments. The red dotted lines indicate the location of
the change-points found by our method when the number of change-points
is unknown.

10.4.2. Simulation Results


10.4.2.1. Known Number of Change-Points
We first simulate data under the model described in Section 10.4.1 with known number
of change-points, i.e., 11. We compare our method with the traditional parametric
likelihood ratio approach (Hinkley, 1970). Note that there are two ways to minimize
the objective function in the parametric likelihood ratio (PLR) method which can be
estimated using the binary segmentation (BS) algorithm and dynamic programming
with the PELT algorithm (Killick et al., 2012). These two algorithms are implemented
using the “change-point” R package. We also compete with the nonparametric methods
called e.divisive which is implemented using the “ecp” R package with α = 1 and k = 11.
The simulation results are shown in Table 10.1.
As expected, parametric methods perform better under normal assumption. How-
ever, our method and e.divisive outperform the parametric likelihood ratio approach
and provide a robust inference for the change-points when the error term does not fol-
low normal or Laplace distribution. We can also see that the standard deviation increases
substantially for PLR methods when the normality assumption is violated. In the cases
of Laplace(0, 1) Cauchy(0, 1), P areto(α = 0.5), P areto(α = 1.5), our method has some
advantages over the e.divisive method.

218
Table 10.1.: Simulation Results for known number of change-points (11). Standard deviations are given in parentheses.
Number in bold represents the method with the smallest total error.

ξ1 (Γ̂est , Γtrue ) ξ2 (Γtrue , Γ̂est ) ξ1 + ξ2


Error N Our method PLR(BS) PLR(PELT) e.divisive Our method PLR(BS) PLR(PELT) e.divisive Our methodPLR(BS)PLR(PELT)e.divisive
500 2.30(1.87) 1.90(2.05) 1.10(0.79) 2.00(0.86) 2.30(1.87) 1.80(1.64) 1.10(0.79) 2.00(0.86) 4.60 3.70 2.20 4.00
N (0, 1)
1000 4.10(3.99) 2.40(2.84) 1.10(1.41) 2.55(2.91) 4.10(3.99) 2.40(2.84) 1.10(1.41) 2.55(2.91) 8.20 4.80 2.20 5.10
500 17.00(13.73) 19.50(17.99) 20.55(10.62) 9.15(8.41) 12.70(10.57) 13.80(10.46) 18.25(15.36) 16.15(17.21) 29.70 33.3 38.80 25.30
χ2(3)
1000 14.45(15.91) 25.20(22.38) 27.00(20.88) 9.70(7.33) 15.40(15.30) 30.80(53.03) 30.10(50.10) 9.00(7.32) 29.85 56.00 57.10 18.70
500 2.00(2.05) 1.40(1.39) 0.90(1.37) 2.10(0.97) 2.00(2.05) 1.40(1.39) 0.90(1.37) 2.10(0.97) 4.00 2.80 1.80 4.20
Laplace(0, 1)
1000 1.60(1.78) 1.05(0.76) 0.75(0.64) 1.65(0.67) 1.70(1.78) 1.05(0.76) 0.75(0.64) 1.65(0.67) 3.30 2.10 3.30

219
1.50
500 24.40(17.59) 94.80(52.51) 94.30(47.92) 18.80(7.69) 23.45(18.67) 46.10(22.36) 47.75(23.57) 29.30(19.67) 47.85 140.90 142.05 48.10
Cauchy(0, 1)
1000 16.00(18.55) 206.95(115.54) 169.30(91.42) 35.00(19.73) 18.50(35.10) 107.10(51.04) 96.15(44.65) 55.60(44.75) 34.50 314.05 265.45 90.60
500 8.30(7.83) 23.95(11.46) 40.65(23.31) 4.85(4.36) 8.75(9.88) 20.15(15.76) 48.65(26.19) 5.65(6.18) 17.05 44.10 89.30 10.50
t(2)
1000 5.95(5.28) 31.40(23.22) 77.95(42.44) 3.75(2.02) 5.95(5.28) 43.40(59.78) 81.60(44.38) 3.75(2.02) 11.90 74.80 159.55 7.50
500 11.40(11.31) 75.65(50.02) 81.30(25.61) 16.25(10.50) 11.10(9.91) 45.25(27.30) 53.20(24.22) 28.85(26.82) 22.5 120.9 134.5 45.1
P areto(α = 0.5)
10. Multiple Change-Point Problems

1000 7.80(5.35) 169.05(105.86)188.35(117.21)28.15(28.93) 10.95(20.20) 104.70(43.37)106.90(40.50)43.30(40.39) 18.75 273.75 295.25 71.45


500 29.10(18.73) 98.95(44.31) 90.90(43.25) 42.45(16.83) 28.00(23.51) 47.90(21.54) 50.05(19.96) 44.40(17.14) 57.10 146.85 140.95 86.85
P areto(α = 1.5)
1000 24.30(14.69) 244.65(121.40) 210.30(97.97) 71.60(40.23) 23.05(24.01) 97.55(51.27) 102.20(44.52)74.80(36.84) 47.35 342.20 312.50 146.40
500 5.60(5.61) 18.75(12.00) 32.45(17.98) 3.65(3.50) 10.00(17.63) 12.85(13.81) 32.30(19.03) 8.75(19.84) 15.60 31.60 64.75 12.40
LN (0, 1)
1000 4.05(3.17) 20.00(19.46) 53.00(27.08) 3.20(1.70) 4.05(3.17) 7.15(6.30) 65.80(52.44) 3.20(1.70) 8.10 27.15 118.80 6.40
10. Multiple Change-Point Problems

10.4.2.2. Unknown Number of Change-Points


Simulation experiments were also conducted when the number of change-points is un-
known. In determining the number of change-points, we adopt the permutation test with
1% significance level in our proposed method while the PLR(PELT) method applies BIC,
and the e.divisive method is implemented using the “ecp” R package with α = 1 and
1% significance level. We also consider the e.cp3o method which is implemented using
the “ecp” R package with δ = 2, K = 15 and α = 1. Table 10.2 displays the simulation
results.
Based on the total change-point location error ξ1 + ξ2 , we can see from Table 10.2
that the PLR(PELT) method performs poorly when the error distribution is not nor-
mal or Laplace as it poorly estimates the number of change-points. The e.cp3o method
also cannot estimate the number of change-points well. The e.divisive method outper-
forms our model in some error terms such as N (0, 1), Laplace(0, 1), and t(2). How-
ever, the e.divisive method cannot find any change-point when the error distributions
are Cauchy(0, 1), P areto(α = 0.5), and P areto(α = 1.5). Note that the e.divisive
method requires the existence of the first two moments of the underlying distributions
because of the divergence measure used in e.divisive. Cauchy(0, 1), P areto(α = 0.5),
and P areto(α = 1.5) do not have both first two moments being finite, and the e.divisive
method may greatly underestimate the number of change-points. However, our method
provides a robust estimation in these situations. Based on our parametric embedding,
our method is capable of detecting various types of changes even for some distributions
without finite first two moments.

10.5. Applications
10.5.1. Array CGH Data
DNA copy numbers refer to the number of copies of genomic DNA in a human. The
usual number is two for normal cells for all the non-sex chromosomes. Variations are
indicative of disease such as cancer. Hence, there is a need to produce copy number
maps. The array data used consist of the log2 -ratio of normalized intensities of the
red/green channels indexed by marker locations on a chromosome where the red and
green channels measure the intensities of the cancer and normal samples respectively.
There have been a number of different techniques proposed for analyzing copy number
variation (CNV) data. It is known that CNVs account for an abundance of genetic
variation and may influence phenotypic differences.
The change-point detection method above was applied to the array CGH data (Array
Comparative Genomic Hybridization) with experimentally tested DNA copy number
alterations by Snijders et al. (2001). This data can be downloaded from http://www.
nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html. These array data sets consist

220
10. Multiple Change-Point Problems

Table 10.2.: Simulation results for unknown number of change-points(11). Number in


bold represents the method with the smallest total error. “-” indicates that
no change-point is identified.
ξ1 (Γ̂est , Γtrue ) ξ2 (Γtrue , Γ̂est )
Error N our methodPLR(PELT) e.cp3o e.divisive our method PLR(PELT) e.cp3o e.divisive
500 1.80(1.11) 5.55(8.94) 32.65(20.65) 2.00(0.86) 2.70(2.72) 1.10(0.72) 2.20(9.37) 2.00(0.86)
N (0, 1)
1000 1.55(1.05) 1.10(1.41) 86.20(67.45) 1.70(0.86) 23.55(50.01) 1.10(1.41) 6.60(16.37) 10.50(36.01)
500 27.40(19.67) 9.45(8.61) 71.60(32.83) 29.60(19.29) 7.60(5.54) 39.15(27.86) 30.50(17.03) 5.40(5.04)
χ2(3)
100016.75(20.84) 10.00(6.15) 199.75(101.38)16.20(15.17)16.55(16.66) 91.75(47.19) 55.40(23.64) 8.95(8.27)
500 2.95(6.00) 2.30(3.71) 33.20(21.75) 2.00(1.17) 4.00(9.53) 1.00(1.52) 7.15(13.35) 2.95(3.56)
Laplace(0, 1)
1000 1.05(0.69) 0.75(0.64) 72.00(48.18) 1.65(0.67) 14.20(40.59) 0.75(0.64) 12.55(21.85) 3.10(6.37)
500 42.75(26.73)43.80(17.99) 121.90(59.00) - 9.30(6.94) 72.55(16.68) 31.35(21.10) -
Cauchy(0, 1)
100023.80(25.80)83.60(23.47) 202.55(74.73) - 12.75(9.51) 138.60(39.64)62.70(36.70) -
500 12.70(10.56) 10.40(6.49) 102.50(51.54) 7.70(6.45) 6.55(4.74) 51.65(26.07) 30.70(19.08) 5.60(4.79)
t(2)
1000 8.15(16.32) 12.35(13.98)206.55(115.11) 3.60(1.88) 11.10(22.46)102.20(40.93)65.50(33.41) 3.75(2.02)
500 13.20(12.83)36.80(12.95) 124.70(52.27) - 9.35(6.84) 66.80(19.67) 27.60(16.27) -
P areto(α = 0.5)
1000 6.55(4.51) 95.70(40.05)259.90(115.63) - 13.05(20.95)124.25(35.78)70.20(34.80) -
500 60.95(25.47)46.60(16.00) 138.80(60.35) - 10.60(6.89) 60.80(19.40) 31.95(17.69) -
P areto(α = 1.5)
100039.05(22.59)98.45(51.44)322.75(195.95) - 19.25(21.19)119.60(42.57)73.90(35.24) -
500 6.65(5.78) 8.20(7.85) 91.25(51.07) 7.25(7.22) 4.60(7.19) 42.65(24.10) 31.60(16.09) 8.50(19.32)
LN (0, 1)
1000 3.55(1.88) 5.00(5.42) 207.35(96.31) 3.20(1.70) 10.75(21.88)107.00(43.62)66.25(30.46) 6.05(12.58)
Error of the Number of change-point ξ1 + ξ2
Error N our methodPLR(PELT) e.cp3o e.divisive our method PLR(PELT) e.cp3o e.divisive
500 0.20(0.52) 0.30(0.66) 6.20(0.41) 0.00(0.00) 4.5 6.65 34.85 4
N (0, 1)
1000 0.70(0.73) 0.00(0.00) 6.55(1.00) 0.15(0.37) 25.1 2.2 92.8 12.2
500 2.30(0.98) 5.40(3.66) 6.95(1.05) 3.05(1.73) 35 48.6 102.1 35
χ2(3)
1000 0.00(0.86) 9.35(2.70) 8.00(1.21) 0.35(0.99) 33.3 101.75 255.15 25.15
500 0.00(0.56) 0.15(0.37) 6.15(0.67) 0.20(0.52) 6.95 3.3 40.35 4.95
Laplace(0, 1)
1000 0.40(0.60) 0.00(0.00) 6.30(0.86) 0.05(0.22) 15.25 1.5 84.55 4.75
500 3.40(1.93) 13.00(0.00) 7.85(1.50) 9.90(0.97) 52.05 116.35 153.25 -
Cauchy(0, 1)
1000 0.45(1.00) 13.00(0.00) 7.80(1.36) 9.65(1.04) 36.55 222.2 265.25 -
500 0.80(1.06) 7.05(3.09) 7.75(1.02) 0.40(0.94) 19.25 62.05 133.2 13.3
t(2)
1000 0.10(0.64) 12.55(1.00) 7.60(1.35) 0.05(0.22) 19.25 114.55 272.05 7.35
500 0.40(0.99) 13.00(0.00) 8.00(1.38) 8.65(1.81) 22.55 103.6 152.3 -
P areto(α = 0.5)
1000 0.25(0.55) 13.00(0.00) 8.25(1.07) 8.75(1.83) 19.6 219.95 330.1 -
500 4.50(1.32) 13.00(0.00) 7.75(1.21) 10.70(0.66) 71.55 107.4 170.75 -
P areto(α = 1.5)
1000 1.10(1.02) 13.00(0.00) 8.00(1.34) 10.65(0.59) 58.3 218.05 396.65 -
500 0.40(0.68) 6.65(3.25) 7.35(1.23) 0.35(0.93) 11.25 50.85 122.85 15.75
LN (0, 1)
1000 0.30(0.66) 11.80(1.47) 7.75(1.37) 0.10(0.45) 14.3 112 273.6 9.25

221
10. Multiple Change-Point Problems
Log2ration array for Cell lines T47D and Changepoint found by our method
1.5

0.5

log 2 (T/R)
-0.5

-1

-1.5

-2

-2.5
0 500 1000 1500 2000
Position
Log2ration array for Cell lines T47D and Changepoint found by e.divisive
1.5

0.5

0
log 2 (T/R)

-0.5

-1

-1.5

-2

-2.5
0 500 1000 1500 2000
Position
Log2ration array for Cell lines T47D and Changepoint found by PLR(PELT)
1.5

0.5

0
log 2 (T/R)

-0.5

-1

-1.5

-2

-2.5
0 500 1000 1500 2000
Position

Figure 10.2.: Log2ration array for Cell lines T47D and change-points found by our
method, e.divisive and PLR(PRLT). Vertical dotted lines are the change-
points found by these method.

of single experiments on 15 human cell strains and 9 other cell lines. Each cell line is
divided by chromosomes. Each array contains measurements for approximately 2200
BACs (bacterial artificial chromosomes) spotted in triplicates. The variable used for
analysis is the normalized average of the log 2 test over reference ratio.
We first apply our method, as well as the e.divisive and PRL(PELT) methods to one
of the cell lines labeled T47D with sample size N = 2295. To determine the number of
change-points, our method uses the 5% significance level. The parametric likelihood ap-
proach (PLR) with PELT algorithm uses BIC penalty in the R package “change-point.”
The e.divisive method uses α = 1, minimum segment length=2, and 5% significance
level. Figure 10.2 flags the change-points found by these three methods. We can see
that our method successfully locates most of the change-points especially those short
but significantly distinct segments. Note that our algorithm won’t flag any single ex-
treme data point as a segment although our minimum length of a segment is set to be
1. Comparing with the e.divisive method, our method found 51 change-points while

222
10. Multiple Change-Point Problems

e.divisive found 43 change-points. Looking at the additional change-points found by our


method, some of them such as three change-points after position 2200 are hard to find.
In terms of the computational time spent in searching for the change-points, e.divisive
took about 15 minutes while our method took about 2 minutes only in a PC with the
same 3.6GHz. Surprisingly, the PLR(RELT) only finds 3 change-points and misses all
of the important change-points with BIC penalty.
Continuing, we applied our method as well as the e.divisive method to 15 chromo-
somes in the data and all of which except GM07081 chromosome 15 are identified as
having DNA copy number alterations by a method called spectral karyotyping. Fig-
ure 10.3 illustrates all change-points identified by our method and e.divisive based on
5% significance level. The results reveal that our method successfully finds those identi-
fied change-points in all chromosomes. Such was not the case for the e.divisive method.
See, for example, GM03563 Chromosome 3, GM01750 chromosome 9, GM01535 chro-
mosome 5, and GM0781 chromosome 7. Sometimes change-points can appear in the
narrow regions at the beginnings or ends of the data such as GM01535 Chromosome 12.
However, e.divisive, HMM, and CBS could not detect these kind of change-points at the
beginnings or ends of the data (Chromosome 9 on GM03563 and Chromosome 12 on
GM01535) while our method could. The LB method could not find the change-points in
Chromosome 15 on GM07081 while our method found a relatively reasonable segment in
the sequence. It should be noted that LB, HMM, and CBS are parametric methods rely-
ing on certain distributional assumptions. Our method is entirely nonparametric though
embedded in a parametric setting. In addition, our method can find chromosomes with
DNA copy number alterations which have not been confirmed by spectral karyotyping.
They may either be false positives or represent real DNA copy number alterations which
are undetectable due of the low resolution of spectral karyotyping.

10.5.2. Interest Rate Data


The monthly real interest rate data over the period from January 1980 to June 2015
(N = 426) are studied. Following Garcia and Perron (1996); Bai and Perron (2003), we
focus on a more recent three-month treasury bill rate data deflated by the CPI. The data
are downloaded from the US Board of Governors of the Federal Reserve System (http://
www.federalreserve.gov/econresdata/default.htm) and US Bureau of Labor Statistics
website (http://www.bls.gov/cpi/#tables).
We applied our method based on a 1% significance level to the real interest rate
data set over 1980:1–2015:6 (N = 426) shown in Figure 10.4. To compare the methods,
four economic recession periods and the positions of change-point found by our method,
e.divisive and PLR(PLET) are shown in Figure 10.5. A summary of the change-point
positions found by three methods for the four recession periods is also presented in
Table 10.3.

223
10. Multiple Change-Point Problems

GM01524 Chromosome 6 GM01535 Chromosome 5


0.6
0.6 0.4

log 2 (T/R)

log 2 (T/R)
0.4
0.2
0.2
0 0

-0.2 -0.2
0 20 40 60 80 0 20 40 60 80
Position Position

GM01535 Chromosome 12
0.5

log 2 (T/R)
0

-0.5

-1
0 20 40 60 80
Position

GM01750 Chromosome 9 GM01750 Chromosome 14


0.6
0.6 0.4
log 2 (T/R)

log 2 (T/R)
0.4
0.2
0.2
0 0

-0.2 -0.2
0 20 40 60 80 100 0 20 40 60
Position Position

GM03134 Chromosome 8
1

0
log 2 (T/R)

-1

-2

-3
0 50 100 150
Position

GM03563 Chromosome 3 GM03563 Chromosome 9


0.5
0.6
log 2 (T/R)

log 2 (T/R)

0
0.4
0.2
-0.5
0
-0.2 -1
0 20 40 60 80 0 20 40 60 80 100
Position Position

GM05296 Chromosome 10

0.6
log 2 (T/R)

0.4
0.2
0
-0.2
0 20 40 60 80 100 120
Position

GM05296 Chromosome 11 GM07081 Chromosome 7


0.5 1
log 2 (T/R)

log 2 (T/R)

0 0.5

-0.5 0

-1 -0.5
0 50 100 150 0 50 100 150
Position Position

GM07081 Chromosome 15
0.4

0.2
log 2 (T/R)

-0.2

-0.4
0 10 20 30 40 50 60 70
Position

Figure 10.3.: The chromosomes with identified alterations and the change-points found
by our method and e.divisive. Our results are shown by red dotted line.
The results of e.divisive are shown by blue dashed line. The blue-red dash-
dot line means the two methods find same change-point.

224
10. Multiple Change-Point Problems
Real Interest rate and change points found by our method 1980:1-2015:6
10

Real Interest rate


0

-5

-10
1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year

Figure 10.4.: The real interest rate data and change-points found by our method

Table 10.3.: Four recession periods and the positions of change-point found by our
method, e.divisive and PLR(PLET). Numbers in bold refer to the method
which successfully finds the starting point or the end point of the recession
period.
Position of change-point found
Four recessions period
Our method e.divisive PLR(PLET)
1980:1–1980:11 1980:11 1980:11 1980:10
1990:7–1991:3 1990:7 1990:8 1990:7
2001:3–2001:11 2001:3 2001:3 2001:3
2007:12–2009:6 2007:12 2007:10 2007:10

From the first subgraph (1980:1–1984:12) in Figure 10.5, our method and e.divisive
both successfully find one change-point at the end of the recession period in 1980:11.
However, PLR(PELT) method falsely overestimates the number of change-points and
falsely detects many change-points at the beginning of the data (during the period
1980–1981). This may be caused by the great variation and outliers at the beginning of
the data set. From the second subgraph (1989:1–1993:12) in Figure 10.5, our method
and PLR(PELT) successfully locate the change-point at the beginning of the recession
period in 1990:7, while e.divisive method locates the change-point a bit late in 1990–1998.
From the third subgraph (2000:1–2004:12) in Figure 10.5, all three methods successfully
locate the change-point at the beginning of the recession period in 2001:03. From the
fourth subgraph (2007:1–2011:12) in Figure 10.5, only our method successfully locates
the change-point at the beginning of the recession period in 2007:12, while the other two
methods find the change-point two months early due to the sudden fall of the interest
rate data. Finally, we conclude that our method successfully locates the change-points
over all four recession periods.

225
10. Multiple Change-Point Problems

Monthly Real Interest rate and change points 1980:1-1984:12


10

Real Interest rate (%)


5

-5

-10
1980 1982 1984
Year

Monthly Real Interest rate and change points 1989:1-1993:12


6
Real Interest rate (%)

-2
1989 1991 1993
Year

Monthly Real Interest rate and change points 2000:1-2004:12


4
Real Interest rate (%)

-2

-4
2000 2002 2004
Year

Monthly Real Interest rate and change points 2007:1-2011:12


4
Real Interest rate (%)

-2

-4
2007 2009 2011
Year

Figure 10.5.: The real interest rate data around four recession periods. Vertical lines
are the change-points found by our method, e.divisive and PLR(PRLT).
Red star points are the periods when U.S. business Recessions actually
happened defined by US National Bureau of Economic Research. Our
results are shown by red dotted line. The results of e.divisive are shown
by blue dashed line. The results of PLR(PRLT) are shown by purple thin
line. When different type lines coincide, it means different methods find
same change-point. Details of the position found can be seen in Table 10.3.

226
10. Multiple Change-Point Problems

Chapter Notes

Although our method assumes independence, we also conducted a series of simulation


experiments where this assumption is violated (see Alvo et al. (2017)). We slightly
changed the structure of the error terms (εi ’s) in the simulation setting to test the per-
formance in this situation. The simulation results revealed that our method outperforms
the other methods when the normality assumption is violated, particularly in the cases
of heavy-tailed distributions. We also checked the independence assumption in the two
real data-sets studied in Section 10.5. We first computed the residuals after removing
the mean of each segment estimated by our method and then applied the Ljung–Box Q
test to the residuals with maximum lag of 20. The p-value of the test is 0.0016 for the
T47D cell line and 0.0001 for the real interest rate data, indicating that the indepen-
dence assumptions for both data sets may not be valid. We conclude that our method
is less sensitive to departures from serial independence.

227
11. Bayesian Models for Ranking
Data
Ranking data are often encountered in practice when judges (or individuals) are asked
to rank a set of t items, which may be political goals, candidates in an election, types of
food, etc. We see examples in voting and elections, market research, and food preference
just to name a few. By studying ranking data, we can understand the judges’ perception
and preferences on the ranked alternatives.
Let R = (R(1), . . . , R(t)) be ranking t items, labeled 1, . . . , t. It will be more
convenient to standardize the rankings as:
  1 2
t+1 t(t − 1)
y = R− 1 / ,
2 12

where y is the t × 1 vector with y ≡ y  y = 1. We consider the following ranking
model:
π(y|κ, θ) = C(κ, θ) exp {κθ  y} ,
where the parameter θ is a t × 1 vector with θ = 1, parameter κ ≥ 0, and C(κ, θ)
is the normalizing constant. In the case of the distance-based models (Alvo and Yu,
2014), the parameter θ can be viewed as if a modal ranking vector. In fact, if R and μ0
represent an observed ranking and the modal ranking of t items respectively, then the
probability of observing R under the Spearman distance-based model is proportional to
  t    
1 2 t (t + 1) (2t + 1) 
exp −λ (R (i) − μ0 (i)) = exp −λ − μ0 R
2 i=1 12
∝ exp {κθ  y} ,
2
where κ = λ t(t12−1) , and y and θ are the standardized rankings of R and μ0 , respectively.
However, the μ0 in the distance-based model is a discrete permutation vector of integers
{1, 2, . . . , t} but the θ in our model is a real-valued vector, representing a consensus
view of the relative preference of the items from the individuals. Since both θ = 1
and y = 1, the term θ  y can be seen as cos φ where φ is the angle between the

© Springer Nature Switzerland AG 2018 229


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_11
11. Bayesian Models for Ranking Data

Angle between θ and Y on a sphere (t=3)

0.5

0
N

-0.5

-1
-1

-0.5

0 -1
-0.5
0.5 0
0.5
y 1 1 x

Figure 11.1.: Illustration for the angle between the consensus score vector θ = (0, 1, 0)
and the standardized observation of (1, 2, 3) on the sphere when t = 3.

consensus score vector θ and the observation y. Figure 11.1 illustrates an example
of the angle between the consensus score vector θ = (0, 1, 0) and the standardized
observation of R = (1, 2, 3) on the sphere for t = 3. The probability of observing a
ranking is proportional to the cosine of the angle from the consensus score vector. The
parameter κ can be viewed as a concentration parameter. For small κ, the distribution
of rankings will appear close to a uniform whereas for larger values of κ, the distribution
of rankings will be more concentrated around the consensus score vector. We call this
new model as angle-based ranking model.
To compute the normalizing constant C(κ, θ), let Ρt be the set of all possible per-
mutations of the integers 1, . . . , t. Then
 + ,
(C(κ, θ))−1 = exp κθ T y . (11.1)
y∈Ρt

Notice that the summation is over the t! elements in Ρt . When t is large, says greater
than 15, the exact calculation of the normalizing constant is prohibitive. Using the fact
that the set of t! permutations lie on a sphere in (t − 1)-space, our model resembles
the continuous von Mises-Fisher distribution, abbreviated as vM F (x|m, κ), which is
defined on a (p − 1) unit sphere with mean direction m and concentration parameter κ:

p(x|κ, m) = Vp (κ) exp(κm x),

230
11. Bayesian Models for Ranking Data

where p
κ 2 −1
Vp (κ) = p ,
(2π) 2 I p2 −1 (κ)
and Ia (κ) is the modified Bessel function of the first kind with order a. Consequently,
we may approximate the sum in (11.1) by an integral over the sphere:
t−3
κ 2
C(κ, θ)  Ct (κ) = t−3 ,
2 2 t!I t−3 (κ)Γ( t−1
2
)
2

where Γ(.) is the gamma function. Table 11.1 shows the error rate of the approximate log-
normalizing constant as compared to the exact one computed by direct summation. Here,
κ is chosen to be 0.01 to 2 and t ranges from 4 to 11. Note that the exact calculation of the
normalizing constant for t = 11 requires the summation of 11! ≈ 3.9 × 107 permutation.
The computer ran out of memory (16GB) beyond t = 11. This approximation seems to
be very accurate even when t = 3. The error drops rapidly as t increases. Note that this
approximation allows us to approximate the first and second derivatives of log C which
can facilitate our computation in what follows.
Notice that κ may grow with t as θ  y is a sum of t terms. It can be seen from the
applications in Section 11.4 that in one of the clusters for the APA data (t = 5), κ is
7.44(≈ 1.5t) (see Table 11.4). We thus compute the error rate for κ = t and κ = 2t as
shown in Figure 11.2. It is found that the approximation is still accurate with error rate
of less than 0.5% for κ = t and is acceptable for large t when κ = 2t as the error rate
decreases in t.
The error rate for Approximate log Normalizing constant
1.6
κ=2t
κ=t
1.4

1.2
Error rate (%)

0.8

0.6

0.4

0.2
3 4 5 6 7 8 9 10 11
t

Figure 11.2.: The error rate of the approximate log-normalizing constant as compared
to the exact one computed by direct summation for κ = t and κ = 2t.

231
Table 11.1.: The error rate of the approximate log-normalizing constant as compared to the exact one computed by direct
summation.
t
κ 3 4 5 6 7 8 9 10 11
0.01 <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001%

232
0.1 <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001% <0.00001%
0.5 0.00003% 0.00042% 0.00024% 0.00013% 0.00007% 0.00004% 0.00003% 0.00002% 0.00001%
0.8 0.00051% 0.00261% 0.00150% 0.00081% 0.00046% 0.00027% 0.00017% 0.00011% 0.00008%
1 0.00175% 0.00607% 0.00354% 0.00194% 0.00110% 0.00066% 0.00041% 0.00027% 0.00018%
2 0.05361% 0.06803% 0.04307% 0.02528% 0.01508% 0.00932% 0.00598% 0.00398% 0.00273%
11. Bayesian Models for Ranking Data
11. Bayesian Models for Ranking Data

11.1. Maximum Likelihood Estimation (MLE) of Our


Model
Let Y = {y 1 , . . . , y N } be a random sample of N standardized rankings drawn from
p(y|κ, θ). The log-likelihood of (κ, θ) is then given by


N
l(κ, θ) = N log Ct (κ) + κθ  y i (11.2)
i=1

Maximizing (11.2) subject to θ =  1 and κ ≥ 0, we find that the maximum likelihood
N
yi
estimator of θ is given by θ̂ M LE = i=1 , and κ̂ is the solution of
 Ni=1 yi 
; ;
; N ;

−Ct (κ) I t−1 (κ) ; y
i=1 i ;
At (κ) ≡ = 2 = ≡ r. (11.3)
Ct (κ) I t−3 (κ) N
2

A simple approximation to the solution of (11.3) following Banerjee et al. (2005) is


given by
r(t − 1 − r2 )
κ̂M LE = .
1 − r2
A more precise approximation can be obtained from a few iterations of Newton’s method.
Using the method suggested by Sra (2012), starting from an initial value κ0 , we can
recursively update κ by iteration:

At (κi ) − r
κi+1 = κi − , i = 0, 1, 2, . . . .
1 − At (κi )2 − t−2
κi
At (κ i )

11.2. Bayesian Method with Conjugate Prior and


Posterior
Taking a Bayesian approach, we consider the following conjugate prior for (κ, θ) as

p(κ, θ) ∝ [Ct (κ)]ν0 exp {β0 κm0 θ} , (11.4)

where m0  = 1, ν0 , β0 ≥ 0. Given Y , the posterior density of (κ, θ) can be expressed


by
 [Ct (κ)]N +ν0
p(κ, θ|Y ) ∝ exp {βκm θ} Vt (βκ) · ,
Vt (βκ)

233
11. Bayesian Models for Ranking Data
  ; N ;
−1 ; ;
where m = β0 m0 + N y
i=1 i β , β = ; β0 m 0 + y
i=1 i ;. The posterior density can
be factored as
p(κ, θ|Y ) = p(θ|κ, Y )p(κ|Y ) (11.5)
where p(θ|κ, Y ) ∼ vM F (θ|m, βκ) and
t−3
[Ct (κ)]N +ν0 κ 2 (υ0 +N ) I t−2 (βκ)
p(κ|Y ) ∝ = ν0 +N
2
.
Vt (βκ) t−2
I t−3 (κ) (βκ) 2
2

The normalizing constant for p(κ|Y ) is not available in closed form. Nunez-Antonio and
Gutiérrez-Pena (2005) suggested using a sampling-importance-resampling (SIR) proce-
dure with a proposal density chosen to be the gamma density with mean κ̂M LE and
variance equal to some prespecified number such as 50 or 100. However, in a simulation
study, it was found that the choice of this variance is crucially related to the performance
of SIR. An improper choice of variance may lead to slow or unsuccessful convergence.
Also the MCMC method leads to intensive computational complexity. Furthermore,
when the sample size N is large, βκ can be very large which complicates the compu-
tation of the term I t−2 (βκ) in Vt (βκ). Thus the calculation of the weights in the SIR
2
method will fail when N is large. We conclude that in view of the difficulties for directly
sampling from p(κ|Y ), it may be preferable to approximate the posterior distribution
with an alternative method known as variational inference (abbreviated VI from here
on).

11.3. Variational Inference


Variational inference provides a deterministic approximation to an intractable posterior
distribution through optimization. We first adopt a joint vMF-Gamma distribution as
the prior for (κ, θ):

p(κ, θ) = p(θ|κ)p(κ)
= vM F (θ|m0 , β0 κ)Gamma(κ|a0 , b0 ),

where Gamma(κ|a0 , b0 ) is the Gamma density function with shape parameter a0 and
rate parameter b0 (i.e., mean equal to ab00 ), and p(θ|κ) = vM F (θ|m0 , β0 κ). The choice
of Gamma(κ|a0 , b0 ) for p(κ) is motivated by the fact that for large values of κ, p(κ)
in (11.4) tends to take the shape of a Gamma density. In fact, for large values of κ,
κ
I t−3 (κ)  √e2πκ , and hence p(κ) becomes the Gamma density with shape (ν0 − 1) t−22
+1
2

234
11. Bayesian Models for Ranking Data

and rate ν0 − β0 :

[Ct (κ)]ν0 t−2


p(κ) ∝ ∝ κ(ν0 −1) 2 exp(−(ν0 − β0 )κ).
Vt (β0 κ)

11.3.1. Optimization of the Variational Distribution


In the variational inference framework, we aim to determine q so as to minimize the
Kullback-Leibler (KL) divergence between p(θ, κ|Y ) and q(θ, κ):
 
q (θ, κ)
KL (q|p) = Eq ln .
p (θ, κ|Y )

This can be shown to be equivalent to maximizing the evidence lower bound (ELBO)
(Blei et al., 2017). So the optimization of the variational factors q(θ|κ) and q(κ) is
performed by maximizing the evidence lower bound L(q) with respect to q on the log-
marginal likelihood, which in our model is given by
 
p(Y |κ, θ)p(θ|κ)p(κ)
L(q) = Eq(θ,κ) ln (11.6)
q(θ|κ)q(κ)
= Eq(θ,κ) [f (θ, κ)] − Eq(θ,κ) [ln q(θ|κ)] − Eq(κ) [ln q(κ)] + constant,

where all the expectations are taken with respect to q(θ, κ) and


N  
 t−3
f (θ, κ) = κθ y i + N ln κ − N ln I t−3 (κ) + κβ0 m0 θ
2 2

i=1 
t−2
+ ln κ − ln I t−2 (κβ0 ) + (a0 − 1) ln κ − b0 κ.
2 2

For fixed κ, the optimal posterior distribution ln q ∗ (θ|κ) is ln q ∗ (θ|κ) = κβ0 m0 θ +
N  ∗
i=1 κθ y i + constant. We recognize q (θ|κ) as a von Mises-Fisher distribution
vM F (θ|m, κβ) where
; ;  
; 
N ; N
; ;
β = ;β0 m0 + y i ; and m = β0 m0 + y i β −1 .
; ;
i=1 i=1

Let g(κ) denote the remaining terms in f (θ, κ) − ln q(θ|κ) which only involve κ:
   
t−3
g(κ) = N + a0 − 1 ln κ − b0 κ − N ln I t−3 (κ) − ln I t−2 (κβ0 ) + ln I t−2 (κβ).
2 2 2 2

It is still difficult to maximize Eq(κ) [g(κ)] − Eq(κ) [ln q(κ)] since it involves the evaluation
of the expected modified Bessel function. Following the similar idea in Taghia et al.

235
11. Bayesian Models for Ranking Data

(2014), we first find a tight lower bound g(κ) for g(κ) so that
 
L(q) ≥ L(q) = Eq(κ) g(κ) − Eq(κ) [ln q(κ)] + constant.

From the properties of the modified Bessel function of the first kind, it is known that
the function ln Iν (x) is strictly concave with respect to x and strictly convex relative to
ln x for all ν > 0. Then, we can have the following two inequalities:
 

ln Iν (x) ≤ ln Iν (x̄) + ln Iν (x̄) (x − x̄), (11.7)
∂x
 

ln Iν (x) ≥ ln Iν (x̄) + ln Iν (x̄) x̄(ln x − ln x̄), (11.8)
∂x

where ∂x ln Iν (x̄) is the first derivative of ln Iν (x) evaluated at x = x̄. Applying inequal-
ity (11.7) for ln I t−3 (κ) and inequality (11.8) for ln I t−2 (κβ0 ), we have
2 2

   
t−3
g(κ) ≥ g(κ) = N + a0 − 1 ln κ − b0 κ + ln I t−2 (βκ̄)
2 2


+ ln I t−2 (βκ̄)βκ̄ (ln βκ − ln βκ̄) − N ln I t−3 (κ̄)
∂βκ 2 2

∂ ∂
−N ln I t−3 (κ̄) (κ − κ̄) − ln I t−2 (β0 κ̄) − ln I t−2 (β0 κ̄)β0 (κ − κ̄) .
∂κ 2 2 ∂β0 κ 2

Since the equality holds when κ = κ̄, we see that the lower bound of L(q) is tight.
Rearranging the terms, we have the approximate optimal solution as ln q ∗ (κ) = (a −
1) ln κ − bκ + constant, where
   
t−3 ∂
a = a0 + N + βκ̄ ln I t−2 (βκ̄) , (11.9)
2 ∂βκ 2

 
∂ ∂
b = b0 + N I t−3 (κ̄) + β0 ln I t−2 (β0 κ̄) . (11.10)
∂κ 2 ∂β0 κ 2

We recognize q ∗ (κ) to be a Gamma(κ|a, b) with shape a and rate b. Finally, the posterior
mode κ̄ can be obtained from the previous iteration as:

a−1
b
if a > 1,
κ̄ = a
(11.11)
b
otherwise.

236
11. Bayesian Models for Ranking Data

11.3.2. Comparison of the True Posterior Distribution and Its


Approximation Obtained by Variational Inference
Since we use a factorized approximation for the posterior distribution in the variational
inference approach, it is of interest to compare the true posterior distribution with its
approximation obtained using the variational inference approach. We simulated two
data sets with κ = 1, θ = (−0.71, 0, 0.71) , t = 3 and different data sizes of N = 20
and N = 100. We generated samples from the posterior distribution by SIR method in
Section 11.2 using a gamma density with mean κ̂M LE and variance equal to 0.2 as the
proposal density. We then applied the above variational inference to generate samples
from the posterior distribution. Figure 11.3 exhibits the histograms and box-plots for
the posterior distributions of κ and θ for different settings.
From Figure 11.3, we see that the posterior distribution using the Bayesian-VI is
very close to the posterior distribution obtained by the Bayesian-SIR method. When
the sample size is small (N = 20), there are more outliers for the Bayesian-SIR method
while the posterior κ for the Bayesian-VI method seems to be more concentrated. When
the sample size is large, the posterior estimates of θ and κ become more accurate and
Bayesian-VI is closer to the posterior distribution obtained by the Bayesian-SIR method.

Posterior distribution, N=20 Posterior distribution, N=100


0.25 0.35
Bayesian−SIR Bayesian−SIR
Bayesian−VI 0.3 Bayesian−VI
0.2
0.25
0.15
density

density

0.2

0.1 0.15

0.1
0.05
0.05

0 0
0 0.5 1 1.5 2 2.5 3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Posterior κ Posterior κ
Bayesian−SIR, N=20 Bayesian−VI, N=20 Bayesian−SIR, N=100 Bayesian−VI, N=100
1 1

0.5 0.5 0.5


0.5

0 0 0 0

−0.5 −0.5 −0.5 −0.5

−1 −1
theta1 theta2 theta3 theta1 theta2 theta3 theta1 theta2 theta3 theta1 theta2 theta3
Posterior θ Posterior θ Posterior θ Posterior θ

Figure 11.3.: Comparison of the posterior distribution obtained by Bayesian SIR method
and the approximate posterior distribution by variational inference ap-
proach. The comparison is illustrated for different data sizes of N = 20
(left) and N = 100 (right).

237
11. Bayesian Models for Ranking Data

11.3.3. Angle-Based Model for Incomplete Rankings


A judge may rank a set of items in accordance with some criteria. However, in real life,
some of the ranking data may be missing either at random or by design. For example,
in the former case, some of the items may not be ranked due to the limited knowledge
of the judges. In this kind of incomplete ranking data, a missing item could have
any rank and this is called subset rankings. In another instance called top-k rankings,
the judges may only rank the top 10 best movies among several recommended. The
unranked movies would in principle receive ranks larger than 10. In those cases, the
notation RI = (2, −, 3, 4, 1) refers to a subset ranking with item 2 unranked while
RI = (2, ∗, ∗, ∗, 1) represents a top two ranking with item 5 ranked first and item 1
ranked second.
In the usual Bayesian framework, missing data problems can be resolved by appealing
+ ,
to Gibbs sampling and data augmentation methods. Let RI1 , . . . , RIN be a set of
N observed incomplete rankings, and let {R∗1 , . . . , R∗N } be their unobserved complete
rankings. We want to have the following posterior distribution:

p(θ, κ|RI1 , . . . , RIN ) ∝ p(θ, κ)p(RI1 , . . . , RIN |θ, κ),

which can be achieved by Gibbs sampling based on the following two full conditional
distributions:

"
N
p(R∗1 , . . . , R∗N |RI1 , . . . , RIN , θ, κ) = p(R∗i |RIi , θ, κ),
i=1

"
N
p(θ, κ|R∗1 , . . . , R∗N ) ∝ p(θ, κ) p(R∗i |θ, κ).
i=1

Sampling from p(R∗1 , . . . , R∗N |RI1 , . . . , RIN , θ, κ) can be generated by using the
Bayesian SIR method or the Bayesian VI method which have been discussed in the
previous sections. More concretely, we need to fill in the missing ranks for each ob-
servation and for that we appeal to the concept of compatibility described in Alvo
and Yu (2014) which considers for an incomplete ranking, the class of complete
order preserving rankings. For example, suppose we observe one incomplete sub-
set ranking RI = (2, −, 3, 4, 1) . The set of corresponding compatible rankings is
+ ,
(2, 5, 3, 4, 1) , (2, 4, 3, 5, 1) , (2, 3, 4, 5, 1) , (3, 2, 4, 5, 1) , (3, 1, 4, 5, 2) .
Generally speaking, let Ω(RIi ) be the set of complete rankings compatible with
RIi . For an incomplete subset ranking with k out of t items being ranked, we will
have a total t!/k! complete rankings in its compatible set. Note that p(R∗i |RIi , θ, κ) ∝
p(R∗i |θ, κ), R∗i ∈ Ω(RIi ). Obviously, direct sampling from this distribution will be tedious
for large t. Instead, we use the Metropolis-Hastings algorithm to draw samples from this

238
11. Bayesian Models for Ranking Data

distribution with the proposed candidates generated uniformly from Ω(RIi ). The idea of
introducing compatible rankings allows us to treat different kinds of incomplete rankings
easily. It is easy to sample uniformly from the compatible rankings since we just need
to fill-in the missing ranks under different situations. In the case of top-k rankings,
the compatibility set will be defined to ensure that the unranked items receive rankings
larger than k. Note that it is also possible to use Monte Carlo EM approach to handle
incomplete rankings under a maximum likelihood setting where the Gibbs sampling is
used in the E-step (see Yu et al. (2005)).

11.4. Applications
11.4.1. Sushi Data Sets
We investigate the two data sets of Kamishima (2003) for finding the difference in food
preference patterns between eastern and western Japan. Historically, western Japan has
been mainly affected by the culture of the Mikado emperor and nobles, while eastern
Japan has been the home of the Shogun and Samurai warriors. Therefore, the preference
patterns in food are different between these two regions (Kamishima, 2003).
The first data set consists of complete rankings of t = 10 different kinds of sushi
given by 5000 respondents according to their preference. The region of respondents
is also recorded (N = 3285 for eastern Japan, 1715 for western Japan). We apply
the MLE, Bayesian-SIR, and Bayesian-VI on both eastern and western Japan data.
We chose non-informative priors for both Bayesian-SIR and Bayesian-VI. Specifically,
the prior parameter m0 is chosen uniformly whereas β0 , a0 , and b0 are chosen to be
small numbers close to zero. Since the sample size N is quite large compared to t, the
estimated models for all three methods are almost the same. Figure 11.4 compares the
posterior means of θ between eastern Japan (blue bar) and western Japan (red bar)
obtained by Bayesian-VI method. Note that the more negative value of θi means that
the more preferable sushi i is. From Figure 11.4, we see that the main difference for
sushi preference between eastern and western Japan occurs in salmon roe, squid, sea eel,
shrimp, and tuna. People in eastern Japan have a greater preference for salmon roe and
tuna than the western Japanese. On the other hand, the latter have a greater preference
for squid, shrimp, and sea eel. Table 11.2 shows the posterior parameter obtained by
Bayesian-VI. It can be seen that the eastern Japanese are slightly more cohesive than
western Japanese since the posterior mean of κ is larger.
The second data set contains incomplete rankings given by 5000 respondents who
were asked to pick and rank some of the t = 100 different kinds of sushi according
to their preference, and most of them only selected and ranked the top 10 out of 100
sushi. Figure 11.5 compares the box-plots of the posterior means of θ between eastern
Japan (blue box) and western Japan (red box) obtained by Bayesian-VI. The posterior

239
11. Bayesian Models for Ranking Data

Sushi data (Complete rankings t=10), Eastern VS Western Japan (by Bayesian−VI)

Cucumber
Egg
Squid
Tuna roll
Sea urchin
Sea eel
Shrimp
Salmon roe
Tuna Eastern Japan
Fatty tuna Western Japan

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8


Posterior mean of θ

Figure 11.4.: Posterior means of θ for the sushi complete ranking data (t = 10) in eastern
Japan (blue bar) and western Japan (red bar) obtained by Bayesian-VI.

Table 11.2.: Posterior parameters for the sushi complete ranking data (t = 10) in eastern
Japan and western Japan obtained by Bayesian-VI.
Posterior Parameter Eastern Japan Western Japan
β 1458.85 741.61
a 18509.84 9462.70
b 3801.57 2087.37
Posterior Mean of κ 4.87 4.53

distribution of θ is based on the Gibbs samplings after dropping the first 200 samples
during the burn-in period. Since there are too many kinds of sushi, this graph doesn’t
allow us to show the name of each sushi. However, we can see that about one-third of
the 100 kinds of sushi have fairly large posterior means of θi and their values are pretty
close to each other. This is mainly because these sushi are less commonly preferred by
Japanese and the respondents hardly chose these sushi in their list. As these sushi are
usually not ranked as top 10, it is natural to see that the posterior distributions of their
θi ’s tend to have a larger variance.
From Figure 11.5, we see that there exists a greater difference between eastern and
western Japan for small θi ’s. Figure 11.6 compares the box-plots of the top 10 smallest
posterior means of θ between eastern Japan (blue box) and western Japan (red box).
The main difference for sushi preference between eastern and western Japan appears to
be in sea eel, salmon roe, tuna, sea urchin, and sea bream. The eastern Japanese prefer
salmon roe, tuna, and sea urchin sushi more than the western Japanese, while the latter
like sea eel and sea bream more than the former. Generally speaking, tuna and sea
urchin are more oily food, while salmon roe and tuna are more seasonal food. So from
the analysis of both data sets, we can conclude that the eastern Japanese usually prefer
more oily and seasonal food than the western Japanese (Kamishima, 2003).

240
11. Bayesian Models for Ranking Data

Sushi data (Incomplete ranking t=100),Posterior ditribution of θ, Eastern VS Western Japan (by Bayesian−VI)

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15


Posterior θ

Figure 11.5.: Box-plots of the posterior means of θ for the sushi incomplete rankings
(t = 100) in eastern Japan (blue box-plots) and western Japan (red box-
plots) obtained by Bayesian-VI.

Sushi data (Incomplete ranking t=100), Top 10 smallest θ, Eastern VS Western Japan (by Bayesian−VI)

Sea bream

Sea urchin

Egg

Fatty tuna

Tuna

Salmon roe

Octopus

Squid

Shrimp

Sea eel
−0.26 −0.24 −0.22 −0.2 −0.18 −0.16 −0.14
Posterior θ

Figure 11.6.: Box-plots of the top 10 smallest posterior means of θ for the sushi incom-
plete rankings (t = 100) in eastern Japan (blue box-plots and blue circles
for outliers) and western Japan (red box-plots and red pluses for outliers)
obtained by Bayesian-VI.

241
11. Bayesian Models for Ranking Data

11.4.2. APA Data


We revisit the well-known APA data set of Diaconis (1988a) which contains 5738 full
rankings of 5 candidates for the presidential election of the American Psychological As-
sociation (APA) in 1980. For this election, members of APA had to rank five candidates
{A,B,C,D,E} in the order of their preference. Candidates A and C are research psychol-
ogists, candidates D and E are clinical psychologists, and candidate B is a community
psychologist. This data set has been studied by Diaconis (1988a) and Kidwell et al.
(2008) who found that the voting population was divided into 3 clusters.
We fit the data using a mixture of G angle-based models, see Xu et al. (2018) for
the details. We chose a non-informative prior for the Bayesian-VI method for a different
number of clusters G = 1 to 5. Specifically, the prior parameter m0g is a randomly
chosen unit vector whereas β0g , d0g , a0g , and b0g are chosen as random numbers close to
zero. The pig are initialized as G1 . Table 11.3 shows the Deviance information criterion
(DIC) for G = 1 to 5. It can be seen that the mixture model with G = 3 clusters attains
the smallest DIC.

Table 11.3.: Deviance information criterion (DIC) for the APA ranking data.
G 1 2 3 4 5
DIC 54827 53497 53281 53367 53375

Table 11.4 indicates the posterior parameters for the three-cluster solution and Fig-
ure 11.7 exhibits the posterior means of θ for the three clusters obtained by Bayesian-VI.
It is very interesting to see that Cluster 1 votes clinical psychologists D and E as their
first and second choices and dislike especially the research psychologist C. Cluster 2

Table 11.4.: Posterior parameters for the APA ranking data (t = 5) for three clusters
obtained by Bayesian-VI.
Posterior Parameter Cluster 1 Cluster 2 Cluster 3
m 0.06 -0.44 0.26
0.02 0.19 0.14
0.78 -0.64 -0.75
-0.54 0.49 0.55
-0.33 0.39 -0.19
β 1067.10 1062.34 414.74
d 3231.09 1317.21 1189.72
a 4756.33 9224.97 1821.73
b 3330.45 1239.41 1197.80
Posterior mean of κ 1.43 7.44 1.52
Posterior mean of τ 56.31% 22.96% 20.73%

242
11. Bayesian Models for Ranking Data

prefers research psychologists A and C but dislikes the others. Cluster 3 prefers research
psychologist C. From Table 11.4, Cluster 1 represents the majority (posterior mean of
τ1 = 56.31%). Cluster 2 is small but more cohesive since the posterior mean of κ2 is
larger. Cluster 3 has a posterior mean of τ3 = 20.73% and κ3 is 1.52. The preferences of
the five candidates made by the voters in the three clusters are heterogeneous and the
mixture model enables us to draw further inference from the data.

APA Data (t=5), Posterior mean of θ by Bayesian−VI for 3 Clusters


Research psychologist A
Community psychologist B
Cluster 1 Research psychologist C
Clinical psychologist D
Clinical psychologist E

Cluster 2

Cluster 3

−0.5 0 0.5
Posterior mean of θ

Figure 11.7.: Comparison of the posterior mean of θ for the APA data (t = 5) for three
clusters obtained by Bayesian-VI.

Chapter Notes
We proposed a new class of general exponential ranking model called angle-based ranking
models. The model assumed a consensus score vector θ where the rankings reflect the
rank-order preference of the items. The probability of observing a ranking is proportional
to the cosine of the angle from the consensus score vector. Unlike distance-based models,
the consensus score vector θ proposed exhibits detailed information on item preferences
while distance-based model only provide equal-spaced modal ranking. We applied the
method to sushi data and concluded that certain types of sushi are seldom eaten by
the Japanese. Our consensus score vector θ defined on a unit sphere can be easily
re-parameterized to incorporate additional arguments or covariates in the model. The
judge-specific covariates could be age, gender, and income, the item-specific covariates
could be prices, weights, and brands, and the judge-item-specific covariates could be
some personal experience on using each phone or brand. Adding those covariates into
the model could greatly improve the power of prediction of our model. We could also
develop Bayesian inference methods to facilitate the computation. Further details are
available in Xu et al. (2018).

243
12. Analysis of Censored Data
Censored data occur when the value of an observation is only partially known. For ex-
ample, it may be known that someone’s exact wealth is unknown but it may be known
that their wealth exceeds one million dollars. In left censoring, the data may fall below
a certain value whereas in right censoring, it may be above a certain value. Type I cen-
soring occurs when the subjects of an experiment are right censored. Type II censoring
occurs when the experiment stops after a certain number of subjects have failed; the re-
maining subjects are then right censored. Truncated data occur when observations never
lie outside a given range. For example, all data outside the unit interval is discarded.
A good example to illustrate the ideas occurs in insurance companies. Left truncation
occurs when policyholders are subject to a deductible whereas right censoring occurs
when policyholders are subject to an upper pay limit.
A major theme of this chapter is to demonstrate that the key to deriving fundamental
results with the embedding approach for censored data lies in an appropriate choice of
the parametric family. We first briefly introduce survival analysis and then provide an
overview of the developments of rank tests for censored data, highlighting the difficulties
caused by ranking incomplete data and describing important landmarks in overcoming
these difficulties. Next, we generalize the parametric embedding approach to give a
new derivation of what these landmark results have finally led to. More importantly,
coupled with the LAN and local minimaxity results, the approach introduced yields
asymptotically optimal tests for local alternatives in the embedded parametric family.
Since the actual alternatives are unknown, the problem of adaptive (data-dependent)
choice of the score function for rank tests has witnessed important developments. We
provide a brief review of this topic and its implications on the choice of the parametric
family in parametric embedding.

12.1. Survival Analysis


We recall some notions from survival analysis. Let T be a random variable which de-
notes the survival time. The survival function denoted S (t) is the probability that an
individual survives up to time t :

S (t) = P (T > t) = 1 − F (t) ,

© Springer Nature Switzerland AG 2018 245


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0_12
12. Analysis of Censored Data

where F (t) is the distribution function of T assumed continuous. The survival function
is a nonincreasing function of time with the properties

S (0) = 1, S (∞) = 0.

A useful property of the mean is


 ∞
μ = E [T ] = S (t) dt.
0

Next we define the hazard function.

Definition 12.1. The hazard function is defined as the limit of the probability that an
individual fails in a very short time interval given that he has survived up to time t:

P (t < T < t + Δt|T > t)


h (t) = lim .
Δt→0 Δt
The hazard function can be expressed in terms of the survival function and the
probability density function f (t)

f (t) S  (t) d
h (t) = =− = − log {S (t)} . (12.1)
S (t) S (t) dt

The hazard function is also known as the instantaneous failure rate. One of the
characteristics of survival data is the existence of incomplete data, the most common
type of which is left truncation and right censoring.

Definition 12.2. Left truncation occurs when subjects enter a study at a specific time
and are followed henceforth until the event occurs or the subject is censored. Right
censoring occurs when a subject leaves the study before the event occurs or the study
ends before the event has occurred.

In right censoring, the observations can be represented by a random vector (T, δ)


where δ indicates if the survival time is observed (δ = 1) or not (δ = 0) .

12.1.1. Kaplan-Meier Estimator


The Kaplan-Meier estimator is a nonparametric estimate of the survival function using
lifetime data. Suppose that there are n individuals in a cohort and that t1 < t2 < . . . are
the actual times of death. Let d1 , d2 , . . . denote the number of deaths that occur at each
of these times and let n1 , n2 , . . . denote the corresponding number of patients remaining
in the cohort. Hence,
ni+1 = ni − di , i = 1, 2, ..

246
12. Analysis of Censored Data

Suppose that
t ∈ [tj , tj−1 ) , j = 1, 2, . . .
Then the probability of surviving beyond time t is
j  
" di
Ŝ (t) = 1− (12.2)
i=1
ni

Equation 12.2 is known as the Kaplan-Meier estimator of the survival function. It is a


step-wise function. Confidence intervals can be constructed using Greenwood’s formula
for the standard error Rodriguez (2005).
Assuming no censored observations, it can be shown by induction that
ni+1
Ŝ (t) = , i = 0, 1, 2, . . .
n
In the case of censored observations, we may assume that

ni+1 = ni − di − ci , i = 0, 1, 2, ..

12.1.2. Locally Most Powerful Tests


We recall from Chapter 8 Hoeffding’s change of measure formula
  < 
f (V(r1 ) ) f (V(rm ) ) n
P {R1 = r1 , . . . , Rm = rm } = Eg ... , (12.3)
g(V(r1 ) ) g(V(rm ) ) m

where Eg denotes expectation with respect to the probability measure under which the
n observations are i.i.d. with common density function g, assuming that g is positive
whenever f is. In particular, consider testing H0 : f = g versus the location alternative
f (x) = g(x − θ) for small positive values of θ. In this case, differentiating both sides of
(4) with respect to θ and letting θ ↓ 0 yield
     < 
∂  m
g (V ) n
P {R1 = r1 , . . . , Rm = rm }
(r )
=− Eg i
. (12.4)
∂θ θ=0 i=1
g(V (r i ) ) m

Hence by an extension of the Neyman-Pearson lemma, the derivative of the power func-
tion at θ = 0 is maximized by a test that rejects H0 when the right-hand side of (4)
exceeds some threshold C, which is chosen so that the test has type I error α when
θ = 0. This test, therefore, is locally most powerful, for testing alternatives of the form
f (x) = g(x − θ), with θ ↓ 0, and examples include the Fisher-Yates test when g is
standard normal and the Wilcoxon test when g(x) = ex /(1 + ex )2 is the logistic density.

247
12. Analysis of Censored Data

A parametric embedding argument similar to (7.3) can be used to give an alternative


derivation of the local optimality of the Fisher-Yates and Wilcoxon tests. Define
 2 

π (x1j , x2j ; θ 1 , θ 2 ) = exp θ l xlj − K (θ1 , θ2 ) p0j , j = 1, . . . , t!, (12.5)
l=1


where θ = (θ 1 , . . . , θ k ) represents the parameter vector for sample (= 1, 2) and x1j ,
x2j are the data from sample 1 and sample 2 with respective sizes m and n − m that are
associated with the ranking (permutation) νj , j = 1, . . . , n!. Under the null hypothesis
H0 : θ 1 = θ 2 , we can assume without loss of generality that the underlying V1 , . . . , Vn
from the combined sample are i.i.d. uniform (by considering G(Vi ), where G is the com-
mon distribution function, assumed to be continuous, of the (Vi ) and that all rankings
of the Vi are equally likely. Hence the above model represents an exponential family
constructed by exponential tilting of the baseline measure (i.e., corresponding to H0 )
on the rank-order data. This has the same spirit as Neyman’s smooth test of the null
hypothesis that the data are i.i.d. uniform against alternatives in the exponential fam-
ily. The parametric embedding makes these results directly applicable to the rank-order
statistics as was discussed in Chapter 7. In particular, this shows that the two-sample
Wilcoxon test of H0 is locally powerful for testing the uniform distribution against the
truncated exponential distribution for which the x j are constrained to lie in the range
(0, 1) of the uniform distribution. Note that these exponential tilting alternatives differ
from the location alternatives in the preceding paragraph not only in their distributional
form (truncated exponential instead of logistic) but also in avoiding the strong assump-
tion of the preceding paragraph that the data have to be generated from the logistic
distribution even under the null hypothesis.

12.2. Local Asymptotic Normality, Hajek-Le Cam


Theory, and Stein’s Least Favorable Parametric
Submodels
The local alternatives in Section 12.1 refer to θ near the value(s) θ0 assumed by the null
hypothesis. The sample size n is not involved in the analysis of local power. On the
other hand, the central limit theorem has played a major role in the development of
rank tests, as asymptotic normality is used to provide approximate critical values under
the null hypothesis and to approximate the power function under alternatives within
O(n−1/2 ) from θ0 . Hajek (1962) applied Le Cam’s contiguity theory to rank tests of the

248
12. Analysis of Censored Data

null hypothesis H0 : Δ = 0 in the simple regression model Yi = α + Δci + εi , in which


εi are i.i.d. with common density function f , using linear rank statistics of the form


n  
Ri
Sn = (ci − c̄)ϕ . (12.6)
i=1
n+1

He derived the asymptotic normality of Sn under the null hypothesis and contiguous
alternatives, and showed the test to have asymptotically maximum power uniformly for
these alternatives if ϕ = −(f  ◦ F −1 )/(f ◦ F −1 ), where F is the distribution function
with derivative f . Note that this result is consistent with the choice of the score function
given by (12.4) for locally most powerful tests. Hajek (1968) subsequently introduced
the projection method to extend these results to local alternatives that need not be
contiguous to the null.
The rank tests in the preceding paragraph deal with the regression setting, which
is related to the location alternatives in Section 12.1. If we focus on k-sample prob-
lems, then parametric embedding as in the second paragraph of that section can be
applied and the idea of local asymptotic normality (LAN), which was also introduced
by Le Cam in conjunction with contiguity, can be applied to derive the LAN prop-
erty of the embedded family. As pointed out in Chapter 7 of van der Vaart (2007), a
sequence of parametric models is LAN if asymptotically (as n → ∞) their likelihood
ratio processes behave like those for the normal mean model via a quadratic expan-
sion of the log-likelihood function. Hajek (1970, 1972) and Le Cam made use of the
LAN property to derive asymptotic optimality in parametric estimation and testing via
convolution theorems and local asymptotic minimax bounds; see van der Vaart (2007),
Chapter 8. The Hajek-Le Cam theory was originally introduced to resolve the long-
standing problem concerning the efficiency of the maximum likelihood estimator in a
parametric model. For the problem of estimating a location parameter or more general
regression parameters, there is a corresponding asymptotic minimax theory introduced
by Huber (1964, 1972, 1973, 1981) associated with robust estimators which consist of
three types: the maximum likelihood type M-estimators, the R-estimators which are
derived from the rank statistics, and the L-estimators which are linear combinations of
order statistics. See Chapters 5, 13, and 22 of van der Vaart (2007). Although “it is
customary to treat nonparametric statistical theory as a subject completely different
from parametric theory,” Stein developed the least favorable parametric subfamilies for
nonparametric testing and estimation as “one of the obvious connections between the
two subjects.” The implication of Stein’s idea on our parametric embedding theme is
the possibility of establishing full asymptotic efficiency of a nonparametric test by using
a “least favorable” parametric family of densities for parametric embedding (see Bickel
(1982)).

249
12. Analysis of Censored Data

12.3. Parametric Embedding with Censored and


Truncated Data
In this section we generalize parametric embedding to the much more complicated setting
of censored and truncated data, and use the generalization of parametric embedding to
revisit a number of major developments for these data. An extension of rank tests to
censored data began with Gehan’s (1965) extension of the Wilcoxon test and Mantel’s
logrank test (Mantel, 1966). An idea similar to Gehan’s was extended to truncated data
by Bhattacharya et al. (1983). Lai and Ying (1991, 1992) gave a unified treatment of
rank statistics for left-truncated and right-censored (LTRC) data. In section 12.3.1 we
generalize the parametric embedding approach to censored data. Section 12.3.2 gives an
overview of the development of hazard-induced rank tests, highlighting the difficulties
caused by ranking incomplete data and describing important landmarks in overcoming
these difficulties. Coupled with the LAN and local minimaxity results of Section 12.3.3,
the approach introduced in Section 12.3.1 yields asymptotically optimal tests for local
alternatives in the embedded family.

12.3.1. Extension of Parametric Embedding to Censored Data


We begin with the right-censored case for which our basic idea of using the hazard
function instead of the density function for exponential tilting can become transparent.
Recall that for complete data V1 , . . . , Vn , the parametric embedding

π (xj ; θ) = exp {θ  xj − K (θ)} p0j , j = 1, . . . , t!

for one sample or


 

2
π (x1j , x2j ; θ 1 , θ 2 ) = exp θ l xlj − K (θ 1 , θ 2 ) p0j , j = 1, . . . , t!
l=1

for two samples assumes (a) equally likely rankings that give rise to p0j and i.i.d. uniform
G(V1 ), . . . , G(Vn ) under the null hypothesis, and (b) exponential tilting via distinct val-
ues of xj or xij that are functions of the ranks. Here, θ l = (θl1 , . . . , θlk ) represents the
parameter vector for sample l and x1j , x2j are the data for samples 1 and 2 respectively
with respective sizes m, n − m. Under the null hypothesis H0 : θ 1 = θ 2 and we can as-
sume without loss in generality that the underlying Vi are i.i.d. uniform. The Vi are not
completely observable when the data are censored so that the observations are (Ṽi , δi ),
where Ṽi = min(Vi , ci ) and δi = I{Vi ≤ci } . Since the rank assigned to Vi for complete data

250
12. Analysis of Censored Data

is the empirical distribution function evaluated to Vi , the analog for censored data is
Ĝ(Ṽi ), where Ĝ is the Kaplan-Meier estimator which is the nonparametric MLE of G
for censored data. Hence the model under the null hypothesis is that of i.i.d. uniform
random variables censored by G(ci ), providing a partial analog of (a). Since Ĝ puts
all its mass at the uncensored observations (with δ = 1), this causes some difficulty in
generalizing (b) because the sample also contains censored observations. We note that
at each uncensored observation Ṽi , the information in the ordered sample conveys not
only the value of Vi but also how many observations Ṽj in the sample are ≥ Ṽi . When
the Vi denote failure times in survival analysis, this means the size of the risk set, that is,
the number of subjects who are at risk at an observed failure time Vi . This resolves the
inherent difficulty of ordering the censored observations whose actual failure times are
unknown except for their exceedance over ci . To rank the data, we need to have a total
order of the sample space, but the subset consisting of censored observations cannot be
totally ordered because the underlying failure times are unknown. Using the observed
failure time and the risk set size at each uncensored observations gives a partial analog of
the ranking for complete data. To be at risk at an observed failure time Vi , the subject
cannot fail prior to Vi . The jump in Ĝ(Ṽi ) basically measures the conditional probability
of failing in an infinitesimal interval around Ṽi given that failure has not occurred prior
to Ṽi . This means that we should think of hazard functions instead of density functions
and perform exponential tilting using the hazard functions rather density functions.
Consider the two-sample problem with censored data. Let V(1) < · · · < V(k) de-
note the ordered uncensored observations of the combined sample, Nj (resp. Mj ) de-
note the number of observations in the combined sample (resp. in sample 1) that are
≥ V(j) , and uj = 1 (resp. 0) if V(j) comes from sample 1 (resp. sample 2). Note
that {(1), . . . , (k), M1 , N1 , . . . , Mk , Nk } is invariant under the group of strictly increas-
ing transformations on the testing problem. We now introduce embedding of the null
model into a smooth parametric family that also consists of alternatives. Instead of
tilting the density functions as before, we define the change of measures via intensity
(hazard) functions, as in Section II.7 of Andersen et al. (1993). Because the normalizing
constant e−K(θ) gets canceled in the numerator and denominator, it does not appear in
the likelihood ratio statistic. On the other hand, the denominator of (12.1) will induce
a function λ0 (t), which can be chosen as the baseline (or null hypothesis) hazard func-
tion, in the likelihood ratio. The analog therefore for one sample takes the proportional
hazards form
π(xj ;θ, t) = h0 (t) exp(θ  xj ). (12.7)
We discuss below the choice of xj that extends xj = X(νj ) to LTRC data, for which we
also define the hazard-induced rank statistics.

251
12. Analysis of Censored Data

12.3.2. From Gehan and Bhattacharya et al. to Hazard-Induced


Rank Tests
In this section we first focus on some landmark developments of two-sample rank tests
for censored data in the literature and then show how the xj can be chosen on the basis
of the insights provided by these developments. We next show how these two-sample
rank statistics can be extended to the k-sample and regression settings, and then further
extend them for left-truncated and LTRC data.
The first landmark development was Gehan’s extension of the Mann-Whitney version
of the Wilcoxon test to censored data. Let T1i (resp. T2j ) denote the actual failure times
of sample 1 (resp. sample 2), and (T̃1i , δ1i ) and (T̃2j , δ2j ) be the corresponding observa-
 n−m
tions. For complete data, the Mann-Whitney statistic is W = m i=1 j=1 w(T1i , T2j ),
where w(t1 , t2 ) = 1 (resp. −1) if t1 > t2 (resp. t1 < t2 ), and w(t1 , t2 ) = 0 if t1 = t2 . For
censored data, Gehan replaced w(T1i , T2j ) by


⎨−1 if T̃1i ≤ T̃2j and δ1i = 1

w(T̃1i , δ1i ; T̃2j , δ2j ) = 1 if T̃1i ≥ T̃2j and δ2j = 1 (12.8)


⎩0 otherwise,

noting that comparisons can be made if the smaller of T̃i and T̃j is uncensored.1 Breslow
(1970) subsequently extended this to the k-sample case and expressed W in the counting
process form  
W = Y (s) dN  (s) − Y  (s) dN (s), (12.9)
 n−m m
where N1 (s) = m i=1 I{T̃1i ≤s,δ1i =1} , N2 (s) = j=1 I{T̃2j ≤s,δ2j =1} and Y1 (s) = i=1 I{T̃1i ≥s}
n−m
and Y2 (s) = j=1 I{T̃2j ≥s} are the corresponding risk set sizes.
Instead of the weight processes Y and Y  that depend on both failures and censor-
ing, Prentice (1978) suggested that a better alternative should depend on the survival
experience in the combined sample. For complete data the classical two-sample rank

statistics have the form Sn = m i=1 an (Ri ), where the scores an (j) are obtained from a
score function ϕ on (0, 1] by an (j) = ϕ(j/n) so that Sn = m i=1 ϕ(Gn (Ti )), where Gn
is the distribution of the combined sample, or by some asymptotically equivalent vari-
ant such as the expected value of ϕ evaluated at the j th uniform order statistic from a
sample of size n. As pointed out, the counterpart of Gn (Ti ) for censored data is Ĝn (T̃i ),
where Ĝn is the Kaplan-Meier estimate based on the combined sample. If δi = 1, T̃i is
the actual failure time and has score ϕ(Ĝn (T̃i )). On the other hand, if δi = 0, then the

1
In fact, Gehan introduced a further refinement depending on whether the larger observation is
censored or not.

252
12. Analysis of Censored Data

actual failure time Ti is unknown, other than that it exceeds T̃i and therefore has score
Φ(Ĝn (T̃i )), where  1
Φ(t) = ϕ(v) dv/(1 − t), 0 ≤ t < 1, (12.10)
t

represents the average of scores ϕ(u) with u ≥ t. This leads to the following extension

of the classical rank statistic mi=1 ϕ(Gn (Ti )) to censored data:

m (
 )
Sn∗ = δi ϕ(T̃i ) + (1 − δi )Φ(T̃i ) . (12.11)
i=1

Prentice (1978) conjectured the asymptotic equivalence of (12.11) to another class of


rank statistics that he proposed for censored data based on the generalized rank vector,
which is a permutation of {1, . . . , n} of the form
 
R = (1), . . . , (k); {(i 1), . . . , (i νi )}i=0,...,k , (12.12)

where V(1) < · · · < V(k) are the ordered uncensored observations of the combined sample
and {Ṽ(i 1) , . . . , Ṽ(i νi ) } is the unordered set of censored observations between V(i) and
V(i+1) , setting V(0) = 0. Cuzick (1985) proved this conjecture under some smoothness
assumptions on ϕ and also extended the proof to show in his Section 3 the asymptotic
equivalence of (12.11) and


k  Mj

Sn = ψ Ĝn (V(j) ) cj − , where ψ = ϕ − Φ. (12.13)
j=1
Nj

This form of rank statistics for censored data dated back to Mantel (1966) with
ψ = 1. As shown by Gu et al. (1991), there is a one-to-one correspondence between ϕ
and ψ:  t
ψ(s)
ϕ(t) = ψ(t) − ds, 0 < t < 1,
0 1−s

and rank statistics of the form (12.13) can be expressed in the form of generalized Mann-
 n−m  
Whitney statistics W = m i=1 j=1 w(T̃i , δi ; T̃j , δj ) with


⎨−nψ(Ĝn (T̃1i ))/Y• (T̃1i ) if T̃1i ≤ T̃2j and δ1i = 1

w(T̃1i , δ1i ; T̃2j , δ2j ) = nψ(Ĝn (T̃2j ))/Y• (T̃2j ) if T̃1i ≥ T̃2j and δ2j = 1 (12.14)


⎩0 otherwise,
m n−m
where Y• (s) = i=1 I{T̃1i ≥s} + j=1 I{T̃2j ≥s} is the risk set size of the combined sample
at s.

253
12. Analysis of Censored Data

The representation (12.13) is convenient for extensions from two-sample to the re-
gression setting in which cj are the covariates in the regression model Vi = Δci + εi , as
in (12.6). The Mj /Nj in (12.13) is now generalized to
 =

n
c̄j = ci I{Ṽi ≥Ṽ(j) } Nj , (12.15)
i=1

which is the average value of the covariate associated with the risk set at the uncensored
observation Ṽ(j) . Lai and Ying (1991, Theorem 1) established the asymptotic normality
of these rank statistics under the null hypothesis H0 : Δ = 0 and under local alternatives.
Analogous to the complete data case, these tests are asymptotically efficient when ψ =
(λ ◦ F −1 )/(λ ◦ F −1 ), where F is the common distribution function and λ the hazard
function of the εi . They proved this result when the data can also be subject to left
truncation.
Suppose (ci , Vi , δi ) can be observed only when Ṽi = min (Vi , ξ) ≥ τi , where (τi , ξi , ci )
are independent random vectors that are independent of the εi . The τi are left truncation
variables and Vi is also subject to right censoring by ξi . The case ξi ≡ ∞ corresponds to
the left-truncated model, for which multiplication of Vi and τi by −1 converts it into a
right-truncated model. Motivated by a controversy in cosmology involving Hubble’s Law
and Chronometric Theory, Bhattacharya et al. (1983) introduced a Mann-Whitney-type

statistic Wn (β) = i
=j wij (Δ) in the regression model Vi = Δci + εi , in which ci
represents log velocity and Vi the negative log of luminosity; moreover, (ci , Vi ) can only
be observed if Vi ≤ v0 . This is a right-truncated model with truncation variables τi ≡ v0 ,
and letting (Vi∗ , c∗i ), i = 1, . . . , n denote the observations, they defined ei (β) = Vi∗ − Δc∗i
and ⎧
⎪ ∗ ∗ ∗
⎨ci − cj if ej (Δ) < ei (Δ) ≤ v0 − Δcj

wij (Δ) = c∗j − c∗i if ei (Δ) < ej (Δ) ≤ v0 − Δc∗i (12.16)


⎩0 otherwise
since it is impossible to compare ei (Δ) and ej (Δ) if

Vi∗ − Δc∗j > v0 − Δc∗j or Vj∗ − Δc∗i > v0 − Δc∗j .

Note the similarity of this idea to that proposed by Gehan for censored data, and again
it has the same drawbacks as before. In fact, as shown by Lai and Ying (1991), what we
discussed in the preceding paragraph for censored data can be readily extended to LTRC
data (u∗i , Ṽi∗ , δi∗ ), i = 1, . . . , n, that are generated from the larger sample consisting of
+  ,
(Vi , ci ), i = 1, . . . , m(n)  inf m : m i=1 I{τi ≤min(Vi ,ci )} = n , with (Ṽi , δi ) observable only

when Ṽi ≥ τi . The risk set size at t in this case is Y (t) = m(n) i=1 I{τi −Δci ≤t≤Ṽi −Δci } and

254
12. Analysis of Censored Data

the nonparametric MLE of the common distribution function G of εi is the product-limit


" 
estimator
(N (s) − N (s−))
Ĝn (t) = 1 − 1− ,
s≤t
Y (s)

where N (s) = m(n)i=1 I{τi −Δci ≤Ṽi −Δci ≤s,δi =1} when the value of Δ is specified (e.g., Δ = 0
under the null hypothesis). The counting process N (s) plays a fundamental role in the
martingale theory underlying the analysis of rank tests via N (s) and Y (s) by Aalen
(1978),Gill (1980) and Andersen et al. (1993, Chapter 5) for censored data, and by Lai
and Ying (1991, 1992) for LTRC data.
As pointed out in Section 12.3.1, the parametric embedding associated with these re-
gression models is that of a location shift family. Parametric embedding via exponential
tilting as in (12.7) is associated with another kind of regression models, called hazard
regression models, which model how the hazard functions (rather than the means) of
Vi vary with the covariates ui . Seminal contributions to this problem were made by
Cox (1972) who introduced the model (9) for censored survival data. Kalbfleisch and
Prentice (1973) derived the marginal likelihood L(θ) of the rank vector R for this model:
⎧ =⎛ ⎞⎫
"k ⎨    ⎬
L(θ) = e θ x(j) ⎝ e θ x(i) ⎠
, (12.17)
⎩ ⎭
j=1 i∈Ij

( )
where Ij = i : Ṽi ≥ Ṽ(j) is the risk set at the ordered uncensored observation Ṽ(j) ,
which is the same as that given by Cox using conditional arguments and later by Cox
(1975) using partial likelihood.
( This can) be readily extended to LTRC data by redefining
the risk set at Ṽ(j) as i : Ṽi ≥ Ṽ(j) ≥ τi . Basically, the regression model in the preceding
paragraph considers the residuals Ṽi − Δci , whereas for hazard regression we consider Ṽi
instead.

12.3.3. Semi-parametric Efficiency via Least Favorable


Parametric Submodels
The LAN property for the embedded families (exponential tilting and location shifts)
associated with rank tests for complete data can be extended to those for LTRC data
discussed in the preceding two sections; see Chapter 8 of Andersen et al. (1993) for
censored data and Lai and Ying (1992) for LTRC data in the regression setting. For
the embedded family (9), the well-known arguments for Cox regression extend readily to
LTRC data if the xi in (9) are the vector of covariates ui . For the two-sample problem

in which xi depends on the generalized rank vector, we can choose xj = ψ(Ĝn (Ṽ(j) )) to
devise an asymptotically efficient rank statistic as in the censored case, where ψ = ϕ−Φ.

255
12. Analysis of Censored Data

The asymptotic efficiency of the rank tests depends on the class of alternatives in the
embedded parametric family, which may not contain the actual alternative.
The problem of finding the parametric family that gives the best asymptotic minimax
bound has been an active area of research since the seminal paper of Stein that describes
a basic idea inherently related to the theme of this chapter:

Clearly a nonparametric problem is at least as difficult as any of the para-


metric problems obtained by assuming we have enough knowledge of the
unknown state of nature to restrict it to a finite-dimensional set. For a
problem in which one wants to estimate a single real-valued function of the
unknown state of nature it frequently happens that there is, through each
state of nature, a one-dimensional problem which is, for large samples, at
least as difficult (to a first approximation) as any other finite-dimensional
problem at that point. If a procedure does essentially as well, for large sam-
ples, as one could do for each such one-dimensional problem, one is justified
in considering the procedure efficient for large samples.

The implication of Stein’s idea on our parametric embedding theme is the possibility of
establishing full asymptotic efficiency of a nonparametric/semi-parametric test by using
a “least favorable” parametric family of densities for parametric embedding. Lai and
Ying (1992, Section 2) have shown how this can be done for regression models with i.i.d.
additive noise εi . The least favorable parametric family has hazard functions of the form
λ(t) + θη(t), where η is an approximation to −λ Γ1 /Γ0 , λ is the hazard function of εi
and it is assumed that for h = 0, 1, 2,


m
Γh (s) = lim m−1 E{chi I{τi −Δci ≤s≤ci −Δci } /(1 − F (τi − Δci )}
m→∞
i=1

exists for every s with F (s) < 1, where F is the distribution function of εi . In particular,
the technical details underlying the approximation are given in (2.26 a, b, c) of that pa-
per. Lai and Ying (1991, 1992) have also shown how these semi-parametric information
bounds can be attained by using a score function that incorporates adaptive estimation
of λ. For a comprehensive overview of semi-parametric efficiency and adaptive estima-
tion in other contexts, see Bickel et al. (1993). For further details on the analysis of
censored data, see Alvo et al. (2018).

256
Appendices

257
A. Description of Data Sets

A.1. Sutton Leisure Time Data


Ranks on Number of Number of
A: Males B: Females C: Both Sexes white females black females
1 2 3 0 1
1 3 2 0 1
2 1 3 1 0
2 3 1 0 5
3 1 2 7 0
3 2 1 6 6

A.2. Umbrella Alternative Data


The Wechsler Adult intelligence scale scores on males by age groups.

Age Group
16–19 20–34 35–54 55–69 >70
8.62 9.85 9.98 9.12 4.80
9.94 10.43 10.69 9.89 9.18
10.06 11.31 11.40 10.57 9.27

A.3. Song Data


The Song data (t=5) is described in Critchlow et al. (1991). Ninety-eight students were
asked to rank 5 words, (1) score, (2) instrument, (3) solo, (4) benediction, and (5) suit,
according to the association with the word “song.” Critchlow et al. (1991) reported that
the average ranks for words (1) to (5) are 2.72, 2.27, 1.60, 3.71, and 4.69, respectively.
However, the available data given is in grouped format and the ranking of 15 students
are unknown and hence discarded, resulting in 83 rankings, as shown in below.

© Springer Nature Switzerland AG 2018 259


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0
Description of Data Sets

Rankings Observed frequency


(32145) 19
(23145) 10
(13245) 9
(42135) 8
(12345) 7
(31245) 6
(32154) 6
(52134) 5
(21345) 4
(24135) 3
(41235) 2
(43125) 2
(52143) 2
others 0

A.4. Goldberg Data


This data is due to Goldberg (1976) data (t=10). In the data, 143 graduates were asked
to rank 10 occupations according to the degree of social prestige. These 10 occupa-
tions are: (i) Faculty member in an academic institution (Fac), (ii) Mechanical engineer
(ME), (iii) Operation researcher (OR), (iv) Technician (Tech), (v) Section supervisor
in a factory (Sup), (vi) Owner of a company employing more than 100 workers (Own),
(vii) Factory foreman (For), (viii) Industrial engineer (IE), (ix) Manager of a production
department employing more than 100 workers (Mgr), and (x) Applied scientist (Sci).
The data are given in Cohen and Mallows (1980) and have been analyzed by many
researchers.
Feigin and Cohen (1978) analyzed the Goldberg data and found three outliers due to
the fact that the corresponding graduates wrongly presented rankings in reverse order.
After reversing these 3 rankings, the average ranks received by the 10 occupations are
8.57, 4.90, 6.29, 1.90, 4.34, 8.13, 1.47, 6.27, 5.29, 7.85, with the convention that bigger
rank means more prestige. Then the preference of graduates is in the order: Fac > Own
> Sci > OR > IE > Mgr > ME > Sup > Tech > For.

A.5. Sushi Data set


We first investigate the two data sets of Kamishima (2003) for finding the difference
in food preference patterns between eastern and western Japan. Historically, western
Japan has been mainly affected by the culture of the Mikado emperor and nobles, while

260
Description of Data Sets

eastern Japan has been the home of the Shogun and Samurai warriors. Therefore, the
preference patterns in food are different between these two regions (Kamishima, 2003).
The first data set is a complete ranking data with t = 10. 5000 respondents are
asked to rank 10 different kinds of Sushi according to their preference. The region of
respondents is also recorded (N=3285 for eastern Japan, 1715 for western Japan).

A.6. APA Data


We revisit the well-known APA data set of Diaconis (1988b) which contains 5738 full
rankings resulting from the American Psychological Association (APA) presidential elec-
tion of 1980. For this election, members of APA had to rank five candidates in order of
preference. Candidates A and C were research psychologists, candidates D and E were
clinical psychologists, and candidate B was a community psychologist. This data set has
been studied by Diaconis (1988b) and Kidwell et al. (2008) which found that the voting
population was divided into 3 clusters. We also fit the data using our mixture model
stated in Section 11.4.2. See also Xu et al. (2018).

261
Description of Data Sets

A.7. January Precipitation Data (in mm) for Saint


John, New Brunswick, Canada
Year Precipitation Year Precipitation Year Precipitation
1894 102.9 1927 117.6 1960 167.4
1895 112.3 1928 126.5 1961 82.3
1896 29.0 1929 119.4 1962 119.6
1897 100.6 1930 89.2 1963 169.2
1898 129.0 1931 112.5 1964 127.8
1899 95.5 1932 145.8 1965 87.9
1900 165.9 1933 112.8 1966 83.3
1901 121.9 1934 134.6 1967 88.9
1902 55.1 1935 300.0 1968 149.6
1903 101.3 1936 135.1 1969 115.8
1904 110.7 1937 138.7 1970 26.2
1905 142.0 1938 100.3 1971 82.8
1906 111.3 1939 85.3 1972 132.3
1907 83.1 1940 38.9 1973 153.4
1908 108.7 1941 57.2 1974 103.1
1909 144.8 1942 112.5 1975 168.9
1910 130.0 1943 28.2 1976 205.5
1911 81.3 1944 40.4 1977 102.5
1912 77.5 1945 127.3 1978 283.2
1913 108.7 1946 82.1 1979 296.2
1914 73.2 1947 110.4 1980 50.2
1915 147.1 1948 154.9 1981 134.3
1916 71.4 1949 156.7 1982 196.3
1917 109.7 1950 142.7 1983 75.3
1918 94.5 1951 161.3 1984 141.6
1919 142.5 1952 300.2 1985 44.0
1920 87.1 1953 141.0 1986 135.4
1921 80.3 1954 197.9 1987 135.0
1922 85.3 1955 156.0 1988 95.7
1923 141.7 1956 203.5 1989 101.0
1924 133.9 1957 152.7 1990 162.1
1925 119.9 1958 221.5 1991 102.2
1926 156.0 1959 147.3

262
Description of Data Sets

A.8. Annual Temperature Data (in ◦C) in Hong Kong


Year max mean min Year max mean min
1961 34.2 22.9 7.3 1989 34.3 23 7.6
1962 35.5 22.7 6 1990 36.1 23.1 7
1963 35.6 23.3 7.1 1991 34.5 23.5 4.6
1964 33.9 22.9 7 1992 35 22.8 8.4
1965 33.4 23.1 7.3 1993 33.5 23.1 5.4
1966 34.7 23.8 7.5 1994 34.1 23.6 7.9
1967 34.4 22.9 4.6 1995 34.2 22.8 9.2
1968 35.7 22.9 5.7 1996 34.3 23.3 5.8
1969 34.7 22.7 4 1997 33.2 23.3 10.2
1970 33.6 22.8 7.6 1998 34.4 24 8.9
1971 33.7 22.7 5.5 1999 35.1 23.8 5.8
1972 34.7 22.8 3.8 2000 34.2 23.3 7.2
1973 33.1 23.3 7 2001 34 23.6 8.9
1974 34.3 22.8 4.2 2002 33.6 23.9 6.8
1975 33.9 22.8 4.3 2003 33.7 23.6 8.8
1976 35.2 22.5 5.7 2004 34.6 23.4 7.7
1977 34.9 23.3 6.2 2005 35.4 23.3 6.4
1978 34.2 22.8 6.9 2006 34 23.5 8
1979 33.8 23.1 6.1 2007 35.3 23.7 10.6
1980 35 23 5.5 2008 34.6 23.1 7.9
1981 33.3 23.1 9.5 2009 34.9 23.5 9.4
1982 34.8 22.9 8.9 2010 34.1 23.2 5.8
1983 33.9 23 6.1 2011 35 23 7.2
1984 34.4 22.5 7 2012 34.5 23.4 7.1
1985 33 22.6 8.8 2013 34.9 23.3 9.2
1986 34.8 22.8 4.8 2014 34.6 23.5 7.3
1987 34.2 23.4 7.6 2015 36.3 24.2 10.3
1988 33.8 22.8 9.5 2016 35.6 23.6 3.1

263
Bibliography
Aalen, O. (1978). Nonparametric estimation of partial transition probabilities in multiple
decrement models. Ann. Statist., 6(3):534–545.

Adkins, L. and Fligner, M. (1998). A non-iterative procedure for maximum likelihood


estimation of the parameters of Mallows’ model based on partial rankings. Commu-
nications in Statistics: Theory and Methods, 27(9):2199–2220.

Albert, J. (2008). Bayesian Computation with R. Springer, second edition.

Alvo, M. (2008). Nonparametric tests of hypotheses for umbrella alternatives. Canadian


Journal of Statistics, 36:143–156.

Alvo, M. (2016). Bridging the gap: a likelihood: function approach for the analysis of
ranking data. Communications in Statistics - Theory and Methods, Series A, 45:5835–
5847.

Alvo, M. and Berthelot, M.-P. (2012). Nonparametric tests of trend for proportions.
International Journal of Statistics and Probability, 1:92–104.

Alvo, M. and Cabilio, P. (1991). On the balanced incomplete block design for rankings.
The Annals of Statistics, 19:1597–1613.

Alvo, M. and Cabilio, P. (1994). Rank test of trend when data are incomplete. Envi-
ronmetrics, 5:21–27.

Alvo, M. and Cabilio, P. (1995). Testing ordered alternatives in the presence of incom-
plete data. Journal of the American Statistical Association, 90:1015–1024.

Alvo, M. and Cabilio, P. (1998). Applications of Hamming distance to the analysis


of block data. In Szyszkowicz, B., editor, Asymptotic Methods in Probability and
Statistics: A Volume in Honour of Miklós Csörgõ, pages 787–799. Elsevier Science,
Amsterdam.

Alvo, M. and Cabilio, P. (1999). A general rank based approach to the analysis of block
data. Communications in Statistics: Theory and Methods, 28:197–215.

© Springer Nature Switzerland AG 2018 265


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0
Bibliography

Alvo, M. and Cabilio, P. (2000). Calculation of hypergeometric probabilities using


Chebyshev polynomials. The American Statistician, 54:141–144.

Alvo, M. and Cabilio, P. (2005). General scores statistics on ranks in the analysis of
unbalanced designs. The Canadian Journal of Statistics, 33:115–129.

Alvo, M., Cabilio, P., and Feigin, P. (1982). Asymptotic theory for measures of concor-
dance with special reference to Kendall’s tau. The Annals of Statistics, 10:1269–1276.

Alvo, M., Lai, T. L., and Yu, P. L. H. (2018). Parametric embedding of nonparametric
inference problems. Journal of Statistical Theory and Practice, 12(1):151–164.

Alvo, M. and Pan, J. (1997). A general theory of hypothesis testing based on rankings.
Journal of Statistical Planning and Inference, 61:219–248.

Alvo, M. and Xu, H. (2017). The analysis of ranking data using score functions and
penalized likelihood. Austrian Journal of Statistics, 46:15–32.

Alvo, M. and Yu, P. L. H. (2014). Statistical Methods for Ranking Data. Springer.

Alvo, M., Yu, P. L. H., and Xu, H. (2017). A semi-parametric approach to the multiple
change-point problem. Working paper, The University of Hong Kong.

Andersen, P., Borgan, O., Gill, R., and Keiding, N. (1993). Statistical Models Based on
Counting Processes. Springer: New York.

Anderson, R. (1959). Use of contingency tables in the analysis of consumer preference


studies. Biometrics, 15:582–590.

Ansari, A. R. and Bradley, R. A. (1960). Rank-sum tests for dispersions. Annals of


Mathematical Statistics, 31(4):1174–1189.

Asmussen, S., Jensen, J., and Rojas-Nandayapa, L. (2016). Exponential family tech-
niques for the lognormal left tail. Scandinavian Journal of Statistics, 43:774–787.

Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change
models. Journal of Applied Econometrics, 18(1):1–22.

Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hyper-
sphere using von Mises-Fisher distributions. Journal of Machine Learning Research,
6(Sep):1345–1382.

Beckett, L. A. (1993). Maximum likelihood estimation in Mallows’ model using partially


ranked data. In Fligner, M. A. and Verducci, J. S., editors, Probability Models and
Statistical Analyses for Ranking Data, pages 92–108. Springer-Verlag.

266
Bibliography

Bhattacharya, P. K., Chernoff, H., and Yang, S. S. (1983). Nonparametric estimation of


the slope of a truncated regression. Ann. Statist., 11(2):505–514.

Bhattacharya, R., Lizhen, L., and Patrangenaru, V. (2016). A Course in Mathematical


Statistics and Large Sample Theory. Springer.

Bickel, P. J. (1982). On adaptive estimation. Annals of Statistics, 10(3):647–671.

Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and
Adaptive Estimation for Semiparametric Models. The Johns Hopkins University Press.

Billingsley, P. (2012). Probability and Measure. John Wiley and Sons, anniversary
edition.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review
for statisticians. Journal of the American Statistical Association, 112(518):859–877.

Box, George, E. and Tiao, George, C. (1973). Bayesian Inference in Statistical Analysis.
Addison-Wesley Publishing Company.

Box, G. and Cox, D. (1964). An analysis of transformations. Journal of the American


Statistical Association, 26:211–252.

Breslow, N. (1970). A generalized Kruskal-Wallis test for comparing K samples subject


to unequal pattern of censorship. Biometrika, 57:579–594.

Busse, L. M., Orbanz, P., and Buhmann, J. M. (2007). Cluster analysis of heterogeneous
rank data. In Proceedings of the 24th International Conference on Machine Learning,
pages 113–120.

Cabilio, P. and Peng, J. (2008). Multiple rank-based testing for ordered alternatives
with incomplete data. Statistics and Probability Letters, 78:2609–2613.

Casella, G. and Berger, R. L. (2002). Statistical Inference. Duxbury Press., second


edition.

Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American
Statistician, 46:167–174.

Cohen, A. and Mallows, C. (1980). Analysis of ranking data. Technical memorandum,


AT&T Bell Laboratories, Murray Hill, N.J.

Conover, W. J. and Iman, R. L. (1981). Rank transformations as a bridge between


parametric and nonparametric statistics. The American Statistician, 35(3):124–129.

Cox, D. and Hinkley, D. (1974). Theoretical Statistics. Chapman Hall, London.

267
Bibliography

Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B.,
34(2):187–220.

Cox, D. R. (1975). Partial likelihood. Biometrika, 62(2):269–276.

Critchlow, D. (1985). Metric Methods for Analyzing Partially Ranked Data. Springer-
Verlag: New York.

Critchlow, D. (1986). A unified approach to constructing nonparametric rank tests.


Technical Report 86–15, Department of Statistics, Purdue University.

Critchlow, D. (1992). On rank statistics: An approach via metrics on the permutation


group. Journal of Statistical Planning and Inference, 32(325–346).

Critchlow, D. and Verducci, J. (1992). Detecting a trend in paired rankings. Applied


Statistics, 41:17–29.

Critchlow, D. E., Fligner, M. A., and Verducci, J. S. (1991). Probability models on


rankings. Journal of Mathematical Psychology, 35:294–318.

Cuzick, J. (1985). Asymptotic properties of censored linear rank tests. Ann. Statist.,
13(1):133–141.

Daniel, W. W. (1990). Applied Nonparametric Statistics. Duxbury, Wadsworth Inc.,


second edition.

Diaconis, P. (1988a). Group Representations in Probability and Statistics. Institute of


Mathematical Statistics, Hayward.

Diaconis, P. (1988b). Group representations in probability and statistics. Lecture Notes-


Monograph Series, 11:i–192.

Diaconis, P. and Graham, R. (1977). Spearman’s footrule as a measure of disarray.


Journal of the Royal Statistical Society Series B, 39:262–268.

Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via


wavelet shrinkage. Journal of the American Statistical Association, 90(432):1200–1224.

Efron, B. (1981). Nonparametric standard errors and confidence intervals. The Canadian
Journal of Statistics, 9(2):139–158.

Feigin, P. D. (1993). Modelling and analysing paired ranking data. In Fligner, M. A.


and Verducci, J. S., editors, Probability Models and Statistical Analyses for Ranking
Data, pages 75–91. Springer-Verlag.

268
Bibliography

Feigin, P. D. and Alvo, M. (1986). Intergroup diversity and concordance for ranking
data: an approach via metrics for permutations. The Annals of Statistics, 14:691–707.

Feigin, P. D. and Cohen, A. (1978). On a model for concordance between judges. Journal
of the Royal Statistical Society Series B, 40:203–213.

Feller, W. (1968). An Introduction to Probability Theory and Its Applications, volume I.


John Wiley & Sons, Inc, New York, third edition.

Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic


Press, New York and London.

Ferguson, T. (1996). A Course in Large Sample Theory. John Wiley and Sons.

Fisher, R. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh.

Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. Journal of


the Royal Statistical Society Series B, 48(3):359–369.

Fraser, D. (1957). Non Parametric Methods in Statistics. John Wiley and Sons., New
York.

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in
the analysis of variance. Journal of the American Statistical Association, 32:675–701.

Gao, X. and Alvo, M. (2005a). A nonparametric test for interaction in two-way layouts.
The Canadian Journal of Statistics, 33:1–15.

Gao, X. and Alvo, M. (2005b). A unified nonparametric approach for unbalanced fac-
torial designs. Journal of the American Statistical Association, 100:926–941.

Gao, X. and Alvo, M. (2008). Nonparametric multiple comparison procedures for unbal-
anced two-way layouts. Journal of Statistical Planning and Inference, 138:3674–3686.

Gao, X., Alvo, M., Chen, J., and Li, G. (2008). Nonparametric multiple comparison
procedures for unbalanced one-way factorial designs. Journal of Statistical Planning
and Inference, 138:2574–2591.

Garcia, R. and Perron, P. (1996). An analysis of the real interest rate under regime
shifts. The Review of Economics and Statistics, pages 111–125.

Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily singly-


censored samples. Biometrika, 52(1/2):203–223.

269
Bibliography

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6:721–741.

Gibbons, J. D. and Chakraborti, S. (2011). Nonparametric Statistical Inference. Chap-


man Hall, New York, 5th edition.

Gill, R. D. (1980). Censoring and Stochastic Integrals. Mathematical Centre, Amster-


dam.

Gotze, F. (1987). Approximations for multivariate U statistics. Journal of Multivariate


Analysis, 22:212–229.

Gu, M. G., Lai, T. L., and Lan, K. K. G. (1991). Rank tests based on censored data
and their sequential analogues. Amer. J. Math. & Management Sci., 11(1–2):147–176.

Hajek, J. (1962). Asymptotically most powerful rank-order tests. Ann. Math. Statist.,
33(3):1124–1147.

Hajek, J. (1968). Asymptotic normality of simple linear rank statistics under alterna-
tives. Ann. Math. Statist., 39:325–346.

Hajek, J. (1970). A characterization of limiting distributions of regular estimates. Z.


fur Wahrsch. und Verw. Gebiete, 14:323–330.

Hajek, J. (1972). Local asymptotic minimax and admissibility in estimation. In


L. LeCam, J. N. and Scott, E., editors, Proc. Sixth Berkeley Symp. Math. Statist.
Prob., volume 1, pages 175–194. University of California Press, Berkeley.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57:97–109.

Hettmansperger, Thomas, P. (1994). Statistical Inference Based on Ranks. John Wiley.

Hinkley, D. (1970). Inference about the change-point in a sequence of random variables.


Biometrika, 57:1–17.

Hájek, J. and Sidak, Z. (1967). Theory of Rank Tests. Academic Press, New York.

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution.


Annals of Mathematical Statistics, 19:293–325.

Hoeffding, W. (1951a). A combinatorial central limit theorem. Annals of Mathematical


Statistics, 22:558–566.

270
Bibliography

Hoeffding, W. (1951b). Optimum non-parametric tests. In Proceedings of the Second


Berkeley Symposium on Mathematical Statistics and Probability, pages 83–92, Berke-
ley, Calif. University of California Press.

Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical


Statistics, 35(1):73–101.

Huber, P. J. (1972). The 1972 Wald lecture robust statistics: A review. Annals of
Mathematical Statistics, 43(4):1041–1067.

Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and Monte Carlo.


Annals of Statistics, 1(5):799–821.

Huber, P. J. (1981). Robust statistics. Wiley, New York.

Jarque, C. and Bera, A. (1987). A test of normality of observations and regression


residuals. International Statistical Review, 55(2):163–172.

Jin, W. R., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gundel, G., and Gib-
son, G. (2001). The contribution of sex, genotype and age to transcriptional variance
in drosophila melanogaster. Nature Genetics, 29:389–395.

John, J. and Williams, E. (1995). Cyclic Designs. Chapman Hall, New York.

Kalbfleisch, J. (1978). Likelihood methods and nonparametric tests. Journal of the


American Statistical Association, 73:167–170.

Kalbfleisch, J. D. and Prentice, R. L. (1973). Marginal likelihoods based on cox’s re-


gression and life model. Biometrika, 60(2):267–278.

Kamishima, T. (2003). Nantonac collaborative filtering: recommendation based on


order responses. In Proceedings of the ninth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 583–588. ACM.

Kannemann, K. (1976). An incidence test for k related samples. Biometrische Zeitschrift,


18:3–11.

Kendall, M. and Stuart, A. (1979). The Advanced Theory of Statistics, volume 2. Griffin,
London, fourth edition.

Kidwell, P., Lebanon, G., and Cleveland, W. S. (2008). Visualizing incomplete and
partially ranked data. IEEE Transactions on Visualization and Computer Graphics,
14(6):1356–1363.

271
Bibliography

Killick, R., Fearnhead, P., and Eckley, I. (2012). Optimal detection of changepoints
with a linear computational cost. Journal of the American Statistical Association,
107(500):1590–1598.

Kruskal, W. H. (1952). A nonparametric test for the several sample problem. Annals of
Mathematical Statistics, 23:525–540.

Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.


Journal of the American Statistical Association, 47(260):583–621.

Lai, T. L. and Ying, Z. (1991). Rank regression methods for left-truncated and right-
censored data. Ann. Statist., 19(2):531–556.

Lai, T. L. and Ying, Z. (1992). Asymptotically efficient estimation in censored and


truncated regression models. Statistica Sinica, 2(1):17–46.

Lancaster, H. (1953). A reconciliation of chi square from metrical and enumerative


aspects. Sankhya, 13:1–10.

Lee, A. (1990). U-Statistics. Marcel Dekker Inc., New York.

Lee, P. H. and Yu, P. L. H. (2012). Mixtures of weighted distance-based models for


ranking data with applications in political studies. Computational Statistics and Data
Analysis, 56:2486–2500.

Lehmann, E. (1975). Nonparametrics: Statistical Methods Based on Ranks. McGraw-


Hill, New York.

Lehmann, E. and Stein, C. (1949). On the theory of some non-parametric hypotheses.


Ann. Math. Statist., 20(1):28–45.

Liang, F., Liu, C., and Carroll, J. D. (2010). Advanced Markov Chain Monte Carlo
Methods. John Wiley & Sons.

Lindley, D. V. and Scott, W. F. (1995). New Cambridge Statistical Tables. Cambridge


University Press, 2nd edition.

Lindsay, B. G. and Qu, A. (2003). Inference functions and quadratic score tests. Statist.
Sci., 18(3):394–410.

Mallows, C. L. (1957). Non-null ranking models. I. Biometrika, 44:114–130.

Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising
in its consideration. Cancer Chemotherapy Reports, 50(3):163–170.

Marden, J. I. (1995). Analyzing and Modeling Rank Data. Chapman Hall, New York.

272
Bibliography

Matteson, D. S. and James, N. A. (2014). A nonparametric approach for multiple change


point analysis of multivariate data. Journal of the American Statistical Association,
109:334–345.

McCullagh, P. (1993). Models on spheres and models for permutations. In Fligner, M. A.


and Verducci, J. S., editors, Probability Models and Statistical Analyses for Ranking
Data, pages 278–283. Springer-Verlag.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). Equations of state calculations by fast computing machines. Journal of Chem-
ical Physics, 21:1087–1092.

Mielke, Paul W., J. and Berry, Kenneth, J. (2001). Permutation Methods: A Distance
Function Approach. Springer.

Neyman, J. (1937). Smooth test for goodness of fit. Skandinavisk Aktuarietidskrift,


20:149–199.

Neyman, J. and Pearson, E. (1933). On the problem of the most efficient tests of
statistical hypotheses. Philo. Trans. Roy. Soc. A, 231:289–337.

Nunez-Antonio, G. and Gutiérrez-Pena, E. (2005). A Bayesian analysis of directional


data using the von Mises–Fisher distribution. Communications in Statistics-Simulation
and Computation, 34(4):989–999.

Page, E. (1963). Ordered hypotheses for multiple treatments: a significance test for
linear ranks. Journal of the American Statistical Association, 58:216–230.

Pearson, K. (1900). On the criterion that a given system of deviations from the probable
in the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling. Philosophical Magazine, pages 157–175.

Prentice, R. (1978). Linear rank tests with right censored data. Biometrika, 65:167–179.

Qian, Z. and Yu, P. L. H. (2018). Weighted distance-based models for ranking data
using the r package rankdist. Journal of Statistical Software, page Forthcoming.

Ralston, A. (1965). A First Course in Numerical Analysis. McGraw Hill, New York.

Randles, Ronald, H. and Wolfe, Douglas, A. (1979). Introduction to the Theory of


Nonparametric Statistics. John Wiley and Sons, Inc.

Rayner, J. C. W., Best, D. J., and Thas, O. (2009a). Generalised smooth tests of
goodness of fit. Journal of Statistical Theory and Practice, pages 665–679.

273
Bibliography

Rayner, J. C. W., Thas, O., and Best, D. J. (2009b). Smooth Tests of Goodness of Fit
Using R. John Wiley and Sons, 2nd edition.

Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer, New
York, 2nd edition.

Rodriguez, G. (2005). Nonparametric Survival Models. Princeton University Press.

Royston, I. (1982). Expected normal order statistics (exact and approximate). Journal
of the Royal Statistical Society Series C, 31(2):161–165. Algorithm AS 177.

Schach, S. (1979). An alternative to the Friedman test with certain optimality properties.
Ann. Statist., 7(3):537–550.

Sen, P. (1968a). Asymptotically efficient tests by the method of n rankings. Journal of


the Royal Statistical Society Series B, 30:312–317.

Sen, P. (1968b). Asymptotically efficient tests by the method of n rankings. Journal of


the Royal Statistical Society Series B, 30:312–317.

Serfling, Robert, J. (2009). Approximating Theorems of Mathematical Statistics. John


Wiley and Sons.

Siegel, S. and Tukey, J. W. (1960). A nonparametric sum of ranks procedure for relative
spread in unpaired samples. Journal of the American Statistical Association, 55:429–
445.

Siegmund, D. (1976). Importance sampling in the Monte Carlo study of sequential tests.
The Annals of Statistics, 4(4):673–684.

Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamil-
ton, G., Hindle, A. K., Huey, B., Kimura, K., et al. (2001). Assembly of microarrays
for genome-wide measurement of DNA copy number. Nature genetics, 29(3):263–264.

Sra, S. (2012). A short note on parameter approximation for von Mises-Fisher distribu-
tions: and a fast implementation of i s (x). Computational Statistics, 27(1):177–190.

Stein, C. (1956). Efficient nonparametric testing and estimation. In Proceedings of


the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:
Contributions to the Theory of Statistics, pages 187–195, Berkeley, Calif. University of
California Press.

Taghia, J., Ma, Z., and Leijon, A. (2014). Bayesian estimation of the von-Mises Fisher
mixture model with variational inference. IEEE transactions on pattern analysis and
machine intelligence, 36(9):1701–1715.

274
Bibliography

Terry, M. (1952). Some rank order tests which are most powerful against specific para-
metric alternatives. Ann. Math. Statist., 23(3):346–366.

Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion
and rejoinder). Annals of Statistics, 22(4):1701–1762.

van der Vaart, A. (2007). Asymptotic Statistics. Cambridge University Press.

Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods.
Statistica Sinica, 21:5–42.

Wald, A. (1949). Statistical decision functions. Ann. Math. Statist., 22(2):165–205.

Xu, H., Alvo, M., and Yu, P. L. H. (2018). Angle-based models for ranking data.
Computational Statistics and Data Analysis, 121:113–136.

Yu, P. L. H., Lam, K. F., and Alvo, M. (2002). Nonparametric rank test for independence
in opinion surveys. Austrian Journal of Statistics, 31:279–290.

Yu, P. L. H., Lam, K. F., and Lo, S. M. (2005). Factor analysis for ranked data with
application to a job selection attitude survey. Journal of the Royal Statistical Society
Series A, 168(3):583–597.

Yu, P. L. H. and Xu, H. (2018). Rank aggregation using latent-scale distance-based


models. Statistics and Computing, page Forthcoming.

275
Index

A Contingency tables, 87
Absolutely continuous, 32 Convergence almost surely, 12
Acceptance/rejection sampling, 39 Convergence in distribution, 12
Analysis of censored data, 245 Convergence in probability, 12
Array CGH Data, 220 Cramr-Rao inequality, 19
Cramér-Wold, 16
B Cumulant, 70
Balanced incomplete block design, 157 Cumulative distribution function, 5
Bayesian methods, 37 Cyclic designs, 158
Bayesian Models, 229 Cyclic structure models, 86
Beta distribution, 38
Borel-Cantelli lemma, 11 D
Delta method, 14
C Design problems, 152
Categorical data, 76 Distance between sets of ranks, 118
Cayley, 80, 84, 86 Distance function, 80
Central moment, 7 Distance-based models, 79
Change of measure distributions, 71 Durbin test, 158
Change-point problem, 209
Chebyshev inequality, 12 E
Chebyshev polynomials, 67, 72 Efficiency, 187
Combinatorial central limit theorem, 55 Exponential family, 19
Compatibility, 152 Exponential tilting, 19
Complete block designs, 175 Extreme value density, 173
Composite likelihood, 35 F
Concordance, 139 Fisher information matrix, 20
Conditional density, 6 Fisher-Yates normal scores, 172
Conjugate prior, 38, 233 Fisher-Yates test, 166, 167
Consistency, 18
Contiguity, 31 G
Gamma density, 43

© Springer Nature Switzerland AG 2018 277


M. Alvo, P. L. H. Yu, A Parametric Approach to Nonparametric Statistics,
Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-319-94153-0
Index

Gene expression data, 131 Lindeberg-Feller central limit theorem,


General block designs, 154 15
Gibbs inequality, 64 Linear rank statistics, 45
Gini’s statistic, 52 Local asymptotic normality, 34
Godambe information matrix, 36 Locally most powerful rank tests, 164
Goldberg data, 260 Locally most powerful test, 26
Goodness of fit tests, 63 location alternatives, 165
Group divisible design, 158 Logistic density, 172
Lyapunov condition, 15
H
Hamming, 80 M
Hamming scores, 144 Mann-Whitney test, 105
Hazard function, 246 Markov Chain Monte Carlo, 40
Hermite polynomials, 67 Maximum likelihood estimator, 21
Hodges-Lehmann estimate, 110 Median, 94
Hoeffding’s formula, 164 Median test, 193
Hypothesis testing, 23 Method of n-rankings, 174
Moment generating function, 7
I Monotone likelihood ratio, 25
Importance Sampling, 39, 69 Multinomial distribution, 22
Incomplete block designs, 176 Multiple change-point problems, 209
Information design matrix, 157 Multi-sample problems, 117
Interest rate data, 223 Multivariate Central Limit Theorem, 16
K Multivariate normal distribution, 16
Kaplan-Meier estimator, 246 N
Kendall, 80 Neyman smooth tests, 66
Kendall correlation, 49 Neyman-Pearson lemma, 23
Kendall scores, 142 Noether condition, 50
Kendall statistic, 52
Kruskal-Wallis statistic, 124 O
Kullback-Leibler information number, Optimal rank tests, 163
64 Order statistics, 7

L P
Laguerre polynomials, 67 Pearson goodness of fit statistic, 78
Le Cam’s first lemma, 32 Penalized likelihood, 147
Le Cam’s second lemma, 33 Permutation, 79
Le Cam’s third lemma, 33 Permutation test, 101
Leisure time data, 259 φ-Component models, 84
Likelihood function, 17 Pitman efficiency, 187

278
Index

Poisson distribution, 29 Stochastically larger, 101


Poisson-Charlier, 67 Sufficiency, 18
Power function, 24, 96 Survival analysis, 245
Probability density function, 6 Symbolic data analysis, 8
Projection theorem, 49
T
Q Tests for interaction, 131
Quadratic form, 17 Tests for trend, 137
The three Amigos, 28
R Two sample problems, 100
Radon-Nikodym derivative, 32 Two sample ranking problem, 145
Random sample, 7 Type I error, 23, 96
Range, 42 Type II error, 23
Regression coefficients, 46
Regression problems, 170 U
Right invariance, 80 U statistics, 51
Umbrella alternative data, 259
S Umbrella alternatives, 124
Sample space, 5 Unbiased, 18
Savage scores, 169 Unified theory of hypothesis testing, 117
Score function, 27 Uniform integrability, 33
Score test, 28
Siegel-Tukey test, 112 V
Sign test, 91 Van der Warden test, 167
Similarity function, 82 Variational inference, 41, 234
Size of the test, 23
Slope of the test, 188 W
Slutsky’s theorem, 13 Walsh sums, 98
Song data, 259 Weak convergence, 12
Spearman, 80 Weak law of large numbers, 12
Spearman correlation, 48 Wilcoxon rank sum test, 105
Spearman Footrule, 58, 80 Wilcoxon scores, 180
Spearman scores, 141 Wilcoxon signed rank test, 97
Square integrable functions, 49 Wilcoxon test, 166

279

You might also like