Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
Experiments With A New Boosting Algorithm: Yoav Freund Robert E. Schapire
Abstract. In an earlier paper, we introduced a new “boosting” This paper describes two distinct sets of experiments.
algorithm called AdaBoost which, theoretically, can be used to In the first set of experiments, described in Section 3, we
significantly reduce the error of any learning algorithm that con- compared boosting to “bagging,” a method described by
sistently generates classifiers whose performance is a little better Breiman [1] which works in the same general fashion (i.e.,
than random guessing. We also introduced the related notion of a by repeatedly rerunning a given weak learning algorithm,
“pseudo-loss” which is a method for forcing a learning algorithm
and combining the computed classifiers), but which con-
of multi-label concepts to concentrate on the labels that are hardest
to discriminate. In this paper, we describe experiments we carried structs each distribution in a simpler manner. (Details given
out to assess how well AdaBoost with and without pseudo-loss, below.) We compared boosting with bagging because both
performs on real learning problems. methods work by combining many classifiers. This com-
We performed two sets of experiments. The first set compared parison allows us to separate out the effect of modifying
boosting to Breiman’s “bagging” method when used to aggregate the distribution on each round (which is done differently by
various classifiers (including decision trees and single attribute- each algorithm) from the effect of voting multiple classifiers
value tests). We compared the performance of the two methods (which is done the same by each).
on a collection of machine-learning benchmarks. In the second In our experiments, we compared boosting to bagging
set of experiments, we studied in more detail the performance of
using a number of different weak learning algorithms of
boosting using a nearest-neighbor classifier on an OCR problem.
varying levels of sophistication. These include: (1) an
algorithm that searches for very simple prediction rules
1 INTRODUCTION which test on a single attribute (similar to Holte’s very sim-
“Boosting” is a general method for improving the perfor- ple classification rules [14]); (2) an algorithm that searches
mance of any learning algorithm. In theory, boosting can be for a single good decision rule that tests on a conjunction
used to significantly reduce the error of any “weak” learning of attribute tests (similar in flavor to the rule-formation
algorithm that consistently generates classifiers which need part of Cohen’s RIPPER algorithm [3] and Fürnkranz and
only be a little bit better than random guessing. Despite Widmer’s IREP algorithm [11]); and (3) Quinlan’s C4.5
the potential benefits of boosting promised by the theoret- decision-tree algorithm [18]. We tested these algorithms on
ical results, the true practical value of boosting can only a collection of 27 benchmark learning problems taken from
be assessed by testing the method on real machine learning the UCI repository.
problems. In this paper, we present such an experimental The main conclusion of our experiments is that boost-
assessment of a new boosting algorithm called AdaBoost. ing performs significantly and uniformly better than bag-
Boosting works by repeatedly running a given weak 1 ging when the weak learning algorithm generates fairly
learning algorithm on various distributions over the train- simple classifiers (algorithms (1) and (2) above). When
ing data, and then combining the classifiers produced by combined with C4.5, boosting still seems to outperform
the weak learner into a single composite classifier. The bagging slightly, but the results are less compelling.
first provably effective boosting algorithms were presented We also found that boosting can be used with very sim-
by Schapire [20] and Freund [9]. More recently, we de- ple rules (algorithm (1)) to construct classifiers that are quite
scribed and analyzed AdaBoost, and we argued that this good relative, say, to C4.5. Kearns and Mansour [16] argue
new boosting algorithm has certain properties which make that C4.5 can itself be viewed as a kind of boosting algo-
it more practical and easier to implement than its prede- rithm, so a comparison of AdaBoost and C4.5 can be seen
cessors [10]. This algorithm, which we used in all our as a comparison of two competing boosting algorithms. See
experiments, is described in detail in Section 2. Dietterich, Kearns and Mansour’s paper [4] for more detail
Home page: “http://www.research.att.com/orgs/ssr/people/uid”.
on this point.
Expected to change to “http://www.research.att.com/˜uid” some- In the second set of experiments, we test the perfor-
time in the near future (for uid
yoav, schapire ). mance of boosting on a nearest neighbor classifier for hand-
1
We use the term “weak” learning algorithm, even though, in written digit recognition. In this case the weak learning
practice, boosting might be combined with a quite strong learning algorithm is very simple, and this lets us gain some insight
algorithm such as C4.5. into the interaction between the boosting algorithm and the
nearest neighbor classifier. We show that the boosting al- Algorithm AdaBoost.M1
Input: sequence of examples
gorithm is an effective way for finding a small subset of
1 1
with labels 1
prototypes that performs almost as well as the complete set. weak learning algorithm WeakLearn
We also show that it compares favorably to the standard integer specifying number of iterations
method of Condensed Nearest Neighbor [13] in terms of its
Initialize 1 1 for all .
test error. Do for 1 2
:
There seem to be two separate reasons for the improve- 1. Call WeakLearn, providing it with the distribution .
ment in performance that is achieved by boosting. The first 2. Get back a hypothesis !" : #%$& .
and better understood effect of boosting is that it generates a 3. Calculate the error of !" : '( )
8 .
hypothesis whose error on the training set is small by com- : *,+.-0/13257
4 6 1
bining many hypotheses whose error may be large (but still If ' :9 1 2, then set ;<(= 1 and abort loop.
better than random guessing). It seems that boosting may be 4. Set >7?'. 1 =@'
.
helpful on learning problems having either of the following 5. Update distribution AF :
two properties. The first property, which holds for many >
if !
real-world problems, is that the observed examples tend to 80B 1 C
ED 1 otherwise
have varying degrees of hardness. For such problems, the C
where is a normalization constant (chosen so that G0B 1
boosting algorithm tends to generate distributions that con- will be a distribution).
centrate on the harder examples, thus challenging the weak Output the final hypothesis:
learning algorithm to perform well on these harder parts of 1
!H(I arg max ) log
the sample space. The second property is that the learning 6KJML
: * + -3/K2 576
>7
algorithm be sensitive to changes in the training examples
so that significantly different hypotheses are generated for Figure 1: The algorithm AdaBoost.M1.
different training sets. In this sense, boosting is similar to
Breiman’s bagging [1] which performs best when the weak 2 THE BOOSTING ALGORITHM
learner exhibits such “unstable” behavior. However, unlike
bagging, boosting tries actively to force the weak learning In this section, we describe our boosting algorithm, called
algorithm to change its hypotheses by changing the distri- AdaBoost. See our earlier paper [10] for more details about
bution over the training examples as a function of the errors the algorithm and its theoretical properties.
made by previously generated hypotheses. We describe two versions of the algorithm which we
The second effect of boosting has to do with variance re- denote AdaBoost.M1 and AdaBoost.M2. The two ver-
duction. Intuitively, taking a weighted majority over many sions are equivalent for binary classification problems and
hypotheses, all of which were trained on different samples differ only in their handling of problems with more than
taken out of the same training set, has the effect of re- two classes.
ducing the random variability of the combined hypothesis.
2.1 ADABOOST.M1
Thus, like bagging, boosting may have the effect of produc-
ing a combined hypothesis whose variance is significantly We begin with the simpler version, AdaBoost.M1. The
lower than those produced by the weak learner. However, boosting algorithm takes as input a training set of N exam-
unlike bagging, boosting may also reduce the bias of the ples OQPSRTVU 1 W
X 1 Y,W[ZZ[Z
W TVU]\ W
X \ Y
^ where U]_ is an instance
learning algorithm, as discussed above. (See Kong and Di- drawn from some space ` and represented in some man-
etterich [17] for further discussion of the bias and variance ner (typically, a vector of attribute values), and X _baQc is
reducing effects of voting multiple hypotheses, as well as the class label associated with U _ . In this paper, we al-
Breiman’s [2] very recent work comparing boosting and ways assume that the set of possible labels c is of finite
bagging in terms of their effects on bias and variance.) In cardinality d .
our first set of experiments, we compare boosting and bag- In addition, the boosting algorithm has access to another
ging, and try to use that comparison to separate between the unspecified learning algorithm, called the weak learning
bias and variance reducing effects of boosting. algorithm, which is denoted generically as WeakLearn.
Previous work. Drucker, Schapire and Simard [8, 7] The boosting algorithm calls WeakLearn repeatedly in
performed the first experiments using a boosting algorithm. a series of rounds. On round e , the booster provides
They used Schapire’s [20] original boosting algorithm com- WeakLearn with a distribution fhg over the training set
bined with a neural net for an OCR problem. Follow- O . In response, WeakLearn computes a classifier or hy-
up comparisons to other ensemble methods were done by pothesis i g : `kj c which should correctly classify
Drucker et al. [6]. More recently, Drucker and Cortes [5] a fraction of the training set that has large probability
used AdaBoost with a decision-tree algorithm for an OCR with respect to f g . That is, the weak learner’s goal is
task. Jackson and Craven [15] used AdaBoost to learn to find a hypothesis i g which minimizes the (training) error
l
classifiers represented by sparse perceptrons, and tested the g P Pr _nm:o +Mp i g TVU _ Y8P q X _r . Note that this error is measured
algorithm on a set of benchmarks. Finally, Quinlan [19] with respect to the distribution fhg that was provided to the
recently conducted an independent comparison of boosting weak learner. This process continues for s rounds, and, at
and bagging combined with C4.5 on a collection of UCI last, the booster combines the weak hypotheses i 1 WZ[ZZW i"t
benchmarks. into a single final hypothesis i H(I .
2
Algorithm AdaBoost.M2 Then the following upper bound holds on the error of the
Input: sequence of examples
1 1
final hypothesis i fin:
with labels 1
weak learning algorithm WeakLearn * : * & +t t
integer specifying number of iterations i fin TU7_ Y8P q X _
1 ) 4( g2 & exp ./) 2) ( g2 0
Let
:
1
[ N
g , 1- ,
Z
Do for 1 2
Theorem 1 implies that the training error of the final hy-
1. Call WeakLearn, providing it with mislabel distribution . pothesis generated by AdaBoost.M1 is small. This does
2. Get back a hypothesis !" : # $ 0 1 . not necessarily imply that the test error is small. However,
3. Calculate the pseudo-loss of !7D :
if the weak hypotheses are “simple” and s “not too large,”
'
? 2 )
1
8 1 = !
!
then the difference between the training and test errors can
[. V M
- 6 2 J also be theoretically bounded (see our earlier paper [10] for
4. Set >7?'. 1 =@'
. more on this subject).
5. Update 8 : The experiments in this paper indicate that the theoreti-
8
- 1 2 2 - 1 B *K+ -0/1 6 132 7*K+.-0/1 6 202
cal bound on the training error is often weak, but generally
80B 1
C
C
>
correct qualitatively. However, the test error tends to be
where is a normalization constant (chosen so that G0B 1
will be a distribution). much better than the theory would suggest, indicating a
clear defect in our theoretical understanding.
Output the final hypothesis: The main disadvantage of AdaBoost.M1 is that it is
1
arg max log unable to handle weak hypotheses with error greater than
! fin 6,JML )
5 1
>7 !
1 2. The expected error of a hypothesis which randomly
Figure 2: The algorithm AdaBoost.M2.
)
guesses the label is 1 1 Md , where d is the number of
possible labels. Thus, for dhP 2, the weak hypotheses need
Still unspecified are: (1) the manner in which f g is to be just slightly better than random guessing, but when
computed on each round, and (2) how i H(I is computed. 21
d 2, the requirement that the error be less than 1 2 is
Different boosting schemes answer these two questions in quite strong and may often be hard to meet.
different ways. AdaBoost.M1 uses the simple rule shown
2.2 ADABOOST.M2
in Figure 1. The initial distribution f 1 is uniform over O so
f 1 T Y P 1 MN for all . To compute distribution fg 1 from The second version of AdaBoost attempts to overcome
f g and the last weak hypothesis i g , we multiply the weight this difficulty by extending the communication between the
of example by some number g a 0 W 1 Y if i g classifies U _ ! #" boosting algorithm and the weak learner. First, we allow
correctly, and otherwise the weight is left unchanged. The the weak learner to generate more expressive hypotheses,
weights are then renormalized by dividing by the normal- which, rather than identifying a single label in c , instead
$
ization constant g . Effectively, “easy” examples that are choose a set of “plausible” labels. This may often be easier
correctly classified by many of the previous weak hypothe- than choosing just one label. For instance, in an OCR
ses get lower weight, and “hard” examples which tend often setting, it may be hard to tell if a particular image is “7”
to be misclassified get higher weight. Thus, AdaBoost fo- or a “9”, but easy to eliminate all of the other possibilities.
cuses the most weight on the examples which seem to be In this case, rather than choosing between 7 and 9, the
hardest for WeakLearn. hypothesis may output the set 7 W 9 indicating that both
!
The number ]g is computed as shown in the figure as a labels are plausible.
function of l g . The final hypothesis i H(I is a weighted vote We also allow the weak learner to indicate a “degree of
(i.e., a weighted linear threshold) of the weak hypotheses. plausibility.” Thus, each weak hypothesis outputs a vector
That is, for a given instance U , i H(I outputs the label X that " 34
0 W 1 , where the components with values close to 1 or
maximizes the sum of the weights of the weak hypotheses 0 correspond to those labels considered to be plausible or
predicting that label. The weight of hypothesis i g is defined implausible, respectively. Note that this vector of values is
%!
to be log T 1 g Y so that greater weight is given to hypotheses not a probability vector, i.e., the components need not sum
with lower error. to one.2
The important theoretical property about AdaBoost.M1 While we give the weak learning algorithm more ex-
is stated in the following theorem. This theorem shows that pressive power, we also place a more complex requirement
if the weak hypotheses consistently have error only slightly on the performance of the weak hypotheses. Rather than
better than 1 2, then the training error of the final hypothesis using the usual prediction error, we ask that the weak hy-
i H(I drops to zero exponentially fast. For binary classifi- potheses do well with respect to a more sophisticated error
cation problems, this means that the weak hypotheses need measure that we call the pseudo-loss. Unlike ordinary error
be only slightly better than random. which is computed with respect to a distribution over exam-
ples, pseudo-loss is computed with respect to a distribution
Theorem 1 ([10]) Suppose the weak learning algorithm
2
WeakLearn, when called by AdaBoost.M1, generates hy- We deliberately use the term “plausible” rather than “prob-
potheses with errors l 1 WZ[ZZ
W l t , where l g is as defined in able” to emphasize the fact that these numbers should not be
Figure 1. Assume each l g 1 2, and let g P 1 2 l g . '& ( ) interpreted as the probability of a given label.
3
over the set of all pairs of examples and incorrect labels. only that the weak hypotheses have pseudo-loss less than
By manipulating this distribution, the boosting algorithm
1 2, i.e., only slightly better than a trivial (constant-valued)
can focus the weak learner not only on hard-to-classify ex- hypothesis, regardless of the number of classes. Also, al-
amples, but more specifically, on the incorrect labels that though the weak hypotheses i(g are evaluated with respect to
are hardest to discriminate. We will see that the boosting the pseudo-loss, we of course evaluate the final hypothesis
algorithm AdaBoost.M2, which is based on these ideas, i H(I using the ordinary error measure.
achieves boosting if each weak hypothesis has pseudo-loss
slightly better than random guessing. Theorem 2 ([10]) Suppose the weak learning algorithm
More formally, a mislabel is a pair T W
X Y where is WeakLearn, when called by AdaBoost.M2 generates hy-
the index of a training example and X is an incorrect label potheses with pseudo-losses l 1 W[ZZZW l t , where l g is as de-
associated with example . Let be the set of all misla- ( )
fined in Figure 2. Let g P 1 2 l g . Then the following
bels: P
T W
X Y : a
1 WZZ[Z
W N W
X<P q X _ Z A mislabel upper bound holds on the error of the final hypothesis i fin:
distribution is a distribution defined over the set
mislabels.
of all * : * & +t
On each round e of boosting, AdaBoost.M2 (Figure 2)
i fin TU _ Y8P q X _
Td ) 1Y 1 ) 4( g
2
4
boosting FindAttrTest boosting FindDecRule bagging FindAttrTest bagging FindDecRule
soybean-small 47 - 4 35 - - 60 60 60 60
error
error
error
error
labor 57 - 2 8 8 40
promoters 106 - 2 57 - -
40 40 40
iris 150 - 3 - 4 - 20 20 20 20
hepatitis 155 - 2 13 6 0 0 0 0
sonar 208 - 2 - 60 -
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
audiology.stand 226 - 24 69 -
cleve 303 - 2 7 6 Figure 3: Comparison of using pseudo-loss versus ordinary error
soybean-large 307 376 19 35 - on multi-class problems for boosting and bagging.
ionosphere 351 - 2 - 34 -
house-votes-84 435 - 2 16 -
In the first phase, the growing set is used to grow a list of
votes1 435 - 2 15 -
crx 690 - 2 9 6 attribute-value tests. Each test compares an attribute to a
breast-cancer-w 699 - 2 - 9 value , similar to the tests used by FindAttrTest. We use
pima-indians-di 768 - 2 - 8 - an entropy-based potential function to guide the growth of
vehicle 846 - 4 - 18 -
vowel 528 462 11 - 10 - the list of tests. The list is initially empty, and one test is
german 1000 - 2 13 7 - added at a time, each time choosing the test that will cause
segmentation 2310 - 7 - 19 - the greatest drop in potential. After the test is chosen, only
hypothyroid 3163 - 2 18 7
sick-euthyroid 3163 - 2 18 7
one branch is expanded, namely, the branch with the highest
splice 3190 - 3 60 - - remaining potential. The list continues to be grown in this
kr-vs-kp 3196 - 2 36 - - fashion until no test remains which will further reduce the
satimage 4435 2000 6 - 36 -
agaricus-lepiot 8124 - 2 22 - -
potential.
letter-recognit 16000 4000 26 - 16 - In the second phase, the list is pruned by selecting the
prefix of the list with minimum error (or pseudo-loss) on
Table 1: The benchmark machine learning problems used in the the pruning set.
experiments. The third weak learner is Quinlan’s C4.5 decision-tree
algorithm [18]. We used all the default options with pruning
with minimum error (or pseudo-loss) on the training set. turned on. Since C4.5 expects an unweighted training sam-
More precisely, FindAttrTest computes a classifier which ple, we used resampling. Also, we did not attempt to use
is defined by an attribute , a value and three predictions AdaBoost.M2 since C4.5 is designed to minimize error,
0 , 1 and ? . This classifier classifies a new example U
not pseudo-loss. Furthermore, we did not expect pseudo-
as follows: if the value of attribute is missing on U , then loss to be helpful when using a weak learning algorithm as
predict ? ; if attribute is discrete and its value on example strong as C4.5, since such an algorithm will usually be able
U is equal to , or if attribute is continuous and its value to find a hypothesis with error less than 1 2.
on U is at most , then predict 0; otherwise predict 1. If
using ordinary error (AdaBoost.M1), these “predictions” 3.2 BAGGING
0 , 1 , ? would be simple classifications; for pseudo-loss, We compared boosting to Breiman’s [1] “bootstrap aggre-
" 34
the “predictions” would be vectors in 0 W 1 (where d is the gating” or “bagging” method for training and combining
number of classes). multiple copies of a learning algorithm. Briefly, the method
The algorithm FindAttrTest searches exhaustively for works by training each copy of the algorithm on a bootstrap
the classifier of the form given above with minimum error or sample, i.e., a sample of size N chosen uniformly at random
pseudo-loss with respect to the distribution provided by the with replacement from the original training set O (of size
booster. In other words, all possible values of , , 0 , 1 and N ). The multiple hypotheses that are computed are then
? are considered. With some preprocessing, this search can combined using simple voting; that is, the final composite
be carried out for the error-based implementation in hT N Y hypothesis classifies an example U to the class most often
time, where is the number of attributes and N the number assigned by the underlying “weak” hypotheses. See his
of examples. As is typical, the pseudo-loss implementation paper for more details. The method can be quite effective,
adds a factor of hTVd Y where d is the number of class labels. especially, according to Breiman, for “unstable” learning
For this algorithm, we used boosting with reweighting. algorithms for which a small change in the data effects a
The second weak learner does a somewhat more sophis- large change in the computed hypothesis.
ticated search for a decision rule that tests on a conjunction In order to compare AdaBoost.M2, which uses pseudo-
of attribute-value tests. We sketch the main ideas of this loss, to bagging, we also extended bagging in a natural
algorithm, which we call FindDecRule, but omit some of way for use with a weak learning algorithm that minimizes
the finer details for lack of space. These details will be pseudo-loss rather than ordinary error. As described in
provided in the full paper. Section 2.2, such a weak learning algorithm expects to be
First, the algorithm requires an unweighted training set, provided with a distribution over the set of all mislabels.
so we use the resampling version of boosting. The given
training set is randomly divided into a growing set using
On each round of bagging, we construct this distribution
using the bootstrap method; that is, we select mislabels
* *
70% of the data, and a pruning set with the remaining 30%. from (chosen uniformly at random with replacement),
5
FindAttrTest FindDecRule C4.5
30 30 30 30
25 25 25 25
30
80 80
20 20 20 20
C4.5
C4.5
C4.5
C4.5
25
bagging
bagging
bagging
60 60 15 15 15 15
20
10 10 10 10
40 40 15
5 5 5 5
10
20 20 0 0 0 0
5 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
0 0 0
0 20 40 60 80 0 20 40 60 80 0 5 10 15 20 25 30
boosting FindAttrTest boosting FindDecRule boosting C4.5 bagging C4.5
boosting
Figure 5: Comparison of C4.5 versus various other boosting and
Figure 4: Comparison of boosting and bagging for each of the bagging methods.
weak learners. for each benchmark. These experiments indicate that boost-
and assign each mislabel weight 1 times the number of * * ing using pseudo-loss clearly outperforms boosting using
times it was chosen. The hypotheses i g computed in this error. Using pseudo-loss did dramatically better than error
manner are then combined using voting in a natural manner; on every non-binary problem (except it did slightly worse
6
FindAttrTest FindDecRule C4.5
error pseudo-loss error pseudo-loss error
name – boost bag boost bag – boost bag boost bag – boost bag
soybean-small 57.6 56.4 48.7 0.2 20.5 51.8 56.0 45.7 0.4 2.9 2.2 3.4 2.2
labor 25.1 8.8 19.1 24.0 7.3 14.6 15.8 13.1 11.3
promoters 29.7 8.9 16.6 25.9 8.3 13.7 22.0 5.0 12.7
iris 35.2 4.7 28.4 4.8 7.1 38.3 4.3 18.8 4.8 5.5 5.9 5.0 5.0
hepatitis 19.7 18.6 16.8 21.6 18.0 20.1 21.2 16.3 17.5
sonar 25.9 16.5 25.9 31.4 16.2 26.1 28.9 19.0 24.3
glass 51.5 51.1 50.9 29.4 54.2 49.7 48.5 47.2 25.0 52.0 31.7 22.7 25.7
audiology.stand 53.5 53.5 53.5 23.6 65.7 53.5 53.5 53.5 19.9 65.7 23.1 16.2 20.1
cleve 27.8 18.8 22.4 27.4 19.7 20.3 26.6 21.7 20.9
soybean-large 64.8 64.5 59.0 9.8 74.2 73.6 73.6 73.6 7.2 66.0 13.3 6.8 12.2
ionosphere 17.8 8.5 17.3 10.3 6.6 9.3 8.9 5.8 6.2
house-votes-84 4.4 3.7 4.4 5.0 4.4 4.4 3.5 5.1 3.6
votes1 12.7 8.9 12.7 13.2 9.4 11.2 10.3 10.4 9.2
crx 14.5 14.4 14.5 14.5 13.5 14.5 15.8 13.8 13.6
breast-cancer-w 8.4 4.4 6.7 8.1 4.1 5.3 5.0 3.3 3.2
pima-indians-di 26.1 24.4 26.1 27.8 25.3 26.4 28.4 25.7 24.4
vehicle 64.3 64.4 57.6 26.1 56.1 61.3 61.2 61.0 25.0 54.3 29.9 22.6 26.1
vowel 81.8 81.8 76.8 18.2 74.7 82.0 72.7 71.6 6.5 63.2 2.2 0.0 0.0
german 30.0 24.9 30.4 30.0 25.4 29.6 29.4 25.0 24.6
segmentation 75.8 75.8 54.5 4.2 72.5 73.7 53.3 54.3 2.4 58.0 3.6 1.4 2.7
hypothyroid 2.2 1.0 2.2 0.8 1.0 0.7 0.8 1.0 0.8
sick-euthyroid 5.6 3.0 5.6 2.4 2.4 2.2 2.2 2.1 2.1
splice 37.0 9.2 35.6 4.4 33.4 29.5 8.0 29.5 4.0 29.5 5.8 4.9 5.2
kr-vs-kp 32.8 4.4 30.7 24.6 0.7 20.8 0.5 0.3 0.6
satimage 58.3 58.3 58.3 14.9 41.6 57.6 56.5 56.7 13.1 30.0 14.8 8.9 10.6
agaricus-lepiot 11.3 0.0 11.3 8.2 0.0 8.2 0.0 0.0 0.0
letter-recognit 92.9 92.9 91.9 34.1 93.7 92.3 91.8 91.8 30.4 93.7 13.8 3.3 6.8
on the problem of recognizing handwritten digits. Our goal On each round of boosting, a weak hypothesis is gener-
ated by adding one prototype at a time to the set until the
is not to improve on the accuracy of the nearest neighbor
classifier, but rather to speed it up. Speed-up is achieved by set reaches a prespecified size. Given any set , we always
reducing the number of prototypes in the hypothesis (and choose the mapping which minimizes the pseudo-loss
of the resulting weak hypothesis (with respect to the given
the minimal set of prototypes that is sufficient to label all candidate prototypes are selected at random according to
the training set correctly. the current (marginal) distribution over the training exam-
The dataset comes from the US Postal Service (USPS) ples. Of these candidates, the one that causes the largest
and consists of 9709 training examples and 2007 test exam- decrease in the pseudo-loss is added to the set , and the
ples. The training and test examples are evidently drawn process is repeated. The boosting process thus influences
from rather different distributions as there is a very signifi- the weak learning algorithm in two ways: first, by changing
cant improvement in the performance if the partition of the the way the ten random examples are selected, and second
data into training and testing is done at random (rather than by changing the calculation of the pseudo-loss.
using the given partition). We report results both on the It often happens that, on the following round of boost-
original partitioning and on a training set and a test set of ing, the same set will have pseudo-loss significantly less
the same sizes that were generated by randomly partitioning
than 1 2 with respect to the new mislabel distribution (but
the union of the original training and test sets.
Each image is represented by a 16 16-matrix of 8-bit
pixels. The metric that we use for identifying the near-
possibly using a different mapping ). In this case, rather
est neighbor, and hence for classifying an instance, is the can be gained from the given partition is exhausted (details
7
4:1/0.23,4/0.22 8:6/0.18,8/0.18 4:9/0.16,4/0.16 4:1/0.23,4/0.22 3:5/0.18,3/0.17 0:6/0.22,0/0.22 7:9/0.20,7/0.19 3:5/0.29,3/0.29 9:9/0.15,4/0.15
0.5
err
0.2
after 3 of the 30 boosting iterations. The first line is after itera- indicates the total number of prototypes that were added to the
tion 4, the second after iteration 12 and the third after iteration 25. combined hypothesis, and the vertical axis indicates error. The
Underneath each image we have a line of the form : 1 1 , 2 2 , topmost jagged line indicates the error of the weak hypothesis
where is 4:1/0.23,4/0.22
4:9/0.16,4/0.16 the label of4:1/0.21,4/0.20
the example, 1 and 9:9/0.19,7/0.18
9:9/0.17,4/0.17 2 are the labels that
9:9/0.19,4/0.19
that is trained at this point on the weighted training set. The
9:9/0.19,4/0.18 9:9/0.21,7/0.21 7:7/0.17,9/0.17
get the highest and second highest vote from the combined hy- bold curve is the bound on the training error calculated using
pothesis at that point in the run of the algorithm, and 1 , 2 are Theorem 2. The lowest thin curve and the medium-bold curve
the corresponding normalized votes. show the performance of the combined hypothesis on the training
set and test set, respectively.
omitted).
We ran 30 iterations of the boosting algorithm, and Comparing to CNN, we see that both the strawman
the number of prototypes we used were 10 for the first algorithm and AdaBoost perform better than CNN even
weak hypothesis, 20 for the second, 40 for the third, 80 for when they use about 900 examples in their hypotheses.
the next five, and 100 for the remaining twenty-two weak Larger hypotheses generated by AdaBoost or strawman are
hypotheses. These sizes were chosen so that the errors of much better than that generated by CNN. The main problem
all of the weak hypotheses are approximately equal. with CNN seems to be its tendency to overfit the training
We compared the performance of our algorithm to a data. AdaBoost and the strawman algorithm seem to suffer
strawman algorithm which uses a single set of prototypes. less from overfitting.
Similar to our algorithm, the prototype set is generated in- Figure 7 shows a typical run of AdaBoost. The upper-
crementally, comparing ten prototype candidates at each most jagged line is a concatenation of the errors of the weak
step, and always choosing the one that minimizes the em- hypotheses with respect to the mislabel distribution. Each
pirical error. We compared the performance of the boosting peak followed by a valley corresponds to the beginning and
algorithm to that of the strawman hypothesis that uses the end errors of a weak hypothesis as it is being constructed,
same number of prototypes. We also compared our per- one prototype at a time. The weighted error always starts
formance to that of the condensed nearest neighbor rule around 50% at the beginning of a boosting iteration and
(CNN) [13], a greedy method for finding a small set of drops to around 20-30%. The heaviest line describes the
prototypes which correctly classify the entire training set. upper bound on the training error that is guaranteed by The-
orem 2, and the two bottom lines describe the training and
4.1 RESULTS AND DISCUSSION test error of the final combined hypothesis.
The results of our experiments are summarized in Ta- It is interesting that the performance of the boosting
ble 3 and Figure 7. Table 3 describes the results from ex- algorithm on the test set improved significantly after the
periments with AdaBoost (each experiment was repeated error on the training set has already become zero. This
10 times using different random seeds), the strawman al- is surprising because an “Occam’s razor” argument would
gorithm (each repeated 7 times) , and CNN (7 times). We predict that increasing the complexity of the hypothesis
compare the results using a random partition of the data into after the error has been reduced to zero is likely to degrade
training and testing and using the partition that was defined the performance on the test set.
by USPS. Figure 6 shows a sample of the examples that are given
We see that in both cases, after more than 970 examples, large weights by the boosting algorithm on a typical run.
the training error of AdaBoost is much better than that of There seem to be two types of “hard” examples. First are
the strawman algorithm. The performance on the test set examples which are very atypical or wrongly labeled (such
is similar, with a slight advantage to AdaBoost when the as example 2 on the first line and examples 3 and 4 on the
hypotheses include more than 1670 examples, but a slight second line). The second type, which tends to dominate on
advantage to strawman if fewer rounds of boosting are used. later iterations, consists of examples that are very similar
After 2670 examples, the error of AdaBoost on the random to each other but have different labels (such as examples 3
partition is (on average) 2 Z 7%, while the error achieved versus 4 on the third line). Although the algorithm at this
by using the whole training set is 2 Z 3%. On the random point was correct on all training examples, it is clear from
partition, the final error is 6 Z 4%, while the error using the the votes it assigned to different labels for these example
whole training set is 5 Z 7%. pairs that it was still trying to improve the discrimination
8
random partition USPS partition
AdaBoost Strawman CNN AdaBoost Strawman CNN
rnd size theory train test train test test (size) theory train test train test test (size)
1 10 524.6 45.9 46.1 37.9 38.3 536.3 42.5 43.1 36.1 37.6
5 230 86.4 6.3 8.5 4.9 6.2 83.0 5.1 12.3 4.2 10.6
10 670 16.0 0.4 4.6 2.0 4.3 10.9 0.1 8.6 1.4 8.3
13 970 4.5 0.0 3.9 1.5 3.8 4.4 (990) 3.3 0.0 8.1 1.0 7.7 8.6 ( 865)
15 1170 2.4 0.0 3.6 1.3 3.6 1.5 0.0 7.7 0.8 7.5
20 1670 0.4 0.0 3.1 0.9 3.3 0.2 0.0 7.0 0.6 7.1
25 2170 0.1 0.0 2.9 0.7 3.0 0.0 0.0 6.7 0.4 6.9
30 2670 0.0 0.0 2.7 0.5 2.8 0.0 0.0 6.4 0.3 6.8
Table 3: Average error rates on training and test sets, in percent. For columns labeled “random partition,” a random partition of the union
of the training and test sets was used; “USPS partition” means the USPS-provided partition into training and test sets was used. Columns
labeled “theory” give theoretical upper bounds on training error calculated using Theorem 2. “Size” indicates number of prototypes
defining the final hypothesis.
between similar examples. This agrees with our intuition ris Drucker, Jeff Jackson, Michael Kearns, Ofer Matan, Partha
that the pseudo-loss is a mechanism that causes the boosting Niyogi, Warren Smith, David Wolpert and the anonymous ICML
algorithm to concentrate on the hard to discriminate labels reviewers for helpful comments, suggestions and criticisms. Fi-
of hard examples. nally, thanks to all who contributed to the datasets used in this
paper.
5 CONCLUSIONS References
[1] Leo Breiman. Bagging predictors. Technical Report 421, Department of
We have demonstrated that AdaBoost can be used in many Statistics, University of California at Berkeley, 1994.
settings to improve the performance of a learning algorithm. [2] Leo Breiman. Bias, variance, and arcing classifiers. Unpublished manuscript,
1996.
When starting with relatively simple classifiers, the im- [3] William Cohen. Fast effective rule induction. In Proceedings of the Twelfth
provement can be especially dramatic, and can often lead to International Conference on Machine Learning, pages 115–123, 1995.
[4] Tom Dietterich, Michael Kearns, and Yishay Mansour. Applying the weak
a composite classifier that outperforms more complex “one- learning framework to understand and improve C4.5. In Machine Learning:
shot” learning algorithms like C4.5. This improvement is Proceedings of the Thirteenth International Conference, 1996.
far greater than can be achieved with bagging. Note, how- [5] Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in
Neural Information Processing Systems 8, 1996.
ever, that for non-binary classification problems, boosting [6] Harris Drucker, Corinna Cortes, L. D. Jackel, Yann LeCun, and Vladimir Vap-
simple classifiers can only be done effectively if the more nik. Boosting and other ensemble methods. Neural Computation, 6(6):1289–
sophisticated pseudo-loss is used. 1301, 1994.
[7] Harris Drucker, Robert Schapire, and Patrice Simard. Boosting performance
When starting with a complex algorithm like C4.5, in neural networks. International Journal of Pattern Recognition and Artificial
boosting can also be used to improve performance, but Intelligence, 7(4):705–719, 1993.
[8] Harris Drucker, Robert Schapire, and Patrice Simard. Improving performance
does not have such a compelling advantage over bagging. in neural networks using a boosting algorithm. In Advances in Neural Infor-
Boosting combined with a complex algorithm may give the mation Processing Systems 5, pages 42–49, 1993.
greatest improvement in performance when there is a rea- [9] Yoav Freund. Boosting a weak learning algorithm by majority. Information
and Computation, 121(2):256–285, 1995.
sonably large amount of data available (note, for instance, [10] Yoav Freund and Robert E. Schapire. A decision-theoreticgeneralization of on-
boosting’s performance on the “letter-recognition” problem line learning and an application to boosting. Unpublished manuscript available
electronically (on our web pages, or by email request). An extended abstract
with 16,000 training examples). Naturally, one needs to appeared in Computational Learning Theory: Second European Conference,
consider whether the improvement in error is worth the ad- EuroCOLT ’95, pages 23–37, Springer-Verlag, 1995.
ditional computation time. Although we used 100 rounds [11] Johannes Fürnkranz and Gerhard Widmer. Incremental reduced error pruning.
In Machine Learning: Proceedings of the Eleventh International Conference,
of boosting, Quinlan [19] got good results using only 10 pages 70–77, 1994.
rounds. [12] Geoffrey W. Gates. The reduced nearest neighbor rule. IEEE Transactions on
Boosting may have other applications, besides reducing Information Theory, pages 431–433, 1972.
[13] Peter E. Hart. The condensed nearest neighbor rule. IEEE Transactions on
the error of a classifier. For instance, we saw in Section 4 Information Theory, IT-14:515–516, May 1968.
that boosting can be used to find a small set of prototypes [14] Robert C. Holte. Very simple classification rules perform well on most com-
monly used datasets. Machine Learning, 11(1):63–91, 1993.
for a nearest neighbor classifier. [15] Jeffrey C. Jackson and Mark W. Craven. Learning sparse perceptrons. In
As described in the introduction, boosting combines two Advances in Neural Information Processing Systems 8, 1996.
effects. It reduces the bias of the weak learner by forcing [16] Michael Kearns and Yishay Mansour. On the boosting ability of top-down
decision tree learning algorithms. In Proceedings of the Twenty-Eighth Annual
the weak learner to concentrate on different parts of the ACM Symposium on the Theory of Computing, 1996.
instance space, and it also reduces the variance of the weak [17] Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding cor-
learner by averaging several hypotheses that were generated rects bias and variance. In Proceedings of the Twelfth International Conference
on Machine Learning, pages 313–321, 1995.
from different subsamples of the training set. While there [18] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
is good theory to explain the bias reducing effects, there is 1993.
[19] J. Ross Quinlan. Bagging, boosting, and C4.5. In Proceedings, Fourteenth
need for a better theory of the variance reduction. National Conference on Artificial Intelligence, 1996.
Acknowledgements. Thanks to Jason Catlett and William Cohen [20] Robert E. Schapire. The strength of weak learnability. Machine Learning,
5(2):197–227, 1990.
for extensive advice on the design of our experiments. Thanks [21] Patrice Simard, Yann Le Cun, and John Denker. Efficient pattern recogni-
to Ross Quinlan for first suggesting a comparison of boosting tion using a new transformation distance. In Advances in Neural Information
and bagging. Thanks also to Leo Breiman, Corinna Cortes, Har- Processing Systems, volume 5, pages 50–58, 1993.