0% found this document useful (0 votes)
9 views

Lecture12-Ch8-ClassBasic-Part2

Data Mining notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture12-Ch8-ClassBasic-Part2

Data Mining notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining:

Concepts and Techniques

— Chapter 8 —
Akhil Chaudhary

1
Chapter 8. Classification: Basic Concepts

n Classification: Basic Concepts


n Decision Tree Induction
n Bayes Classification Methods
n Rule-Based Classification
n Model Evaluation and Selection
n Summary

2
BayesClassification: Why?

n A statistical classifier: performs probabilistic prediction, i.e.,


predicts class membership probability. It is based on Bayes’
Theorem (described in next slide).

n Performance:

n A simple Bayesian classifier, naïve Bayesian classifier, leads


to performance that is comparable to decision tree and
selected neural network classifiers.

n Bayesian classifiers are highly accurate and fast when


applied to large data sets.

3
BayesClassification: Why?

n Incremental: Each training example can incrementally


increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data.

n Class-conditional Independence (key assumption in naive Bayesian


classifier):

n Naive Bayesian classifier assumes that the effect of an attribute


value on a given class is independent of the effect of other
attributes’ values. This assumption is called class-conditional
independence.

4
Bayes’ Theorem

n Let X be a specific data tuple. In Bayesian terms, X is considered


“evidence.”

n Let H be some hypothesis that the data tuple X belongs to a


specified class C.

n For classification problems, we want to determine P(H|X), the


probability that the hypothesis H holds given the “evidence” or
observed data tuple X.
n In other words, we are looking for the probability that tuple X

belongs to class C, given that we know the attribute


description of X.
5
Bayes’ Theorem

n In Bayesian statistics, P(H|X) is the posterior probability of H,


conditional on X.

n For example, suppose that :


n The data set is about customers described using the

attributes age and income.


n X is a 35-year-old customer with an income of $40,000.

n H is the hypothesis that the customer will buy a computer.

n Then P(H|X) reflects the probability that customer X will buy a


computer given that we know the customer’s age and income.

6
Bayes’ Theorem

n In Bayesian statistics, P(H) is the prior probability of H.

n With the previous example, this is the probability that any given
customer will buy a computer, regardless of age, income, or any
other information, for that matter.
n The posterior probability, P(H|X), is based on more information
(e.g., customer information) than the prior probability, P(H), which
is independent of X.

n Similarly, P(X|H) is the posterior probability of X, conditional on


H. That is, it is the probability that a customer, X, is 35 years old
and earns $40,000, given that we know the customer will buy a
computer.
7
Bayes’ Theorem

n P(X) is the prior probability of X. Using our example, it is the


probability that a person from our set of customers is 35 years
old and earns $40,000.
n “How are these probabilities estimated?”
n P(H), P(X|H), and P(X) may be estimated using the given data
set, as we shall see in the next example.
n Bayes’ theorem is useful in that it provides a way of
calculating the posterior probability, P(H|X), using P(H),
P(X|H), and P(X).
n Theoretically, Bayes’ theorem can be described using:

8
Naive Bayesian Classification

The naive Bayesian classifier, or simple Bayesian classifier, works in


the following manner:
n 1) Let D be a training set of tuples. Each tuple is associated a

class label. As usual, each tuple is represented by an n-


dimensional vector, X = {x1, x2, … , xn}, the elements in the vector
correspond to attribute A1, A2, … , An, respectively.
n 2) Suppose that there are m classes, C1, C2, … , Cm. Given a tuple,

X, naive Bayesian classifier concludes that X belongs to the class


that has the highest posterior probability conditional on X.
n That is, the naive Bayesian classifier concludes that tuple X

belongs to the class Ci if and only if:

9
Naive Bayesian Classification
n Therefore, we need to maximize P(Ci|X). The class Ci which leads to
the maximum P(Ci|X) is called the maximum posteriori hypothesis.
With Bayes’ theorem (Eq. 8.10), we have:

n 3) As P(X) is a constant for all classes, to maximize P(Ci|X), we only


need to maximize P(X|Ci)P(Ci).
n If class prior probability P(Ci) is unknown, then it is commonly assumed
that the classes are equally likely, that is, P(C1)=P(C2)=…=P(Cm), and
thereafter we only need to maximize P(X|Ci).
n Otherwise, we maximize P(X|Ci)P(Ci). Note that the class prior
probability could be estimated using P(Ci) = |Ci,D|/|D|, where |Ci,D| is
the number of training tuples belonging to class Ci.

10
Naive Bayesian Classification

n 4) Given a data set with many attributes, it would be extremely


computationally expensive to compute P(X|Ci).
n To reduce computation in evaluating P(X|Ci), the assumption of

class-conditional independence is made.


n This presumes that the attributes’ values are conditionally

independent of one another, given the class label of the tuple


(i.e., that there is no dependence relationship between
attributes).
n Therefore, we have:

11
Naive Bayesian Classification

n We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), …,


P(xn|Ci) from the training tuples.

n Recall that here xk refers to the value of attribute Ak for tuple


X. For each attribute, we need to know whether the attribute
is categorical or continuous-valued.

n a) If Ak is categorical, then P(xk|Ci) is the ratio of the number


of tuples that belong to class Ci and satisfy Ak = xk to |Ci,D|.
Note that |Ci,D| is the number of training tuples belonging to
class Ci.

12
Naive Bayesian Classification
b) If Ak is continuous-valued, then we need to do a bit more work,
but the calculation is pretty straightforward. A continuous-valued
attribute is typically assumed to have a Gaussian distribution with a
mean 𝜇 and standard deviation 𝜎 , defined by:

So that:

We simply need to compute 𝜇Ci and 𝜎Ci , which are the mean (i.e.,
average) and standard deviation, respectively, of the values of
attribute Ak for training tuples of class Ci . We then plug these two
quantities into Eq. (8.13), together with xk, to estimate P(xk|Ci) .

13
Naive Bayesian Classification

n For example:
n Let X = (35, $40,000), where A1 and A2 are the attributes age and
income, respectively.
n Let the class label attribute be buys_computer.
n The associated class label for X is yes (i.e., buys_computer = yes).
n Let’s suppose that age has not been discretized and therefore
exists as a continuous-valued attribute.
n Suppose that from the training set, we find that customers in D
who buy a computer are 38 ± 12 years of age. In other words, for
attribute age and this class, we have 𝜇 = 38 years and 𝜎= 12. We
can plug these quantities, along with x1 = 35 for our tuple X, into
Eq. (8.13) to estimate P(age = 35 | buys_computer = yes).

14
Naive Bayesian Classification

n 5) To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each


class Ci. The classifier predicts that the class label of tuple X is the
class Ci if and only if:

In other words, the predicted class label for X is the class Ci for
which P(X|Ci)P(Ci) is the maximum.

15
Naive Bayesian Classification: Example

n To illustrate how naïve Bayesian classification works, we use the


same data set used for decision tree induction (i.e. Table 8.1).
n Specifically:
n The data tuples are described by the attributes age, income,
student, and credit_rating.
n The class label attribute, buys_computer, has two distinct
values (namely, {yes, no}).
n Let C1 correspond to the class buys_computer = yes and C2
correspond to buys_computer = no.
n The tuple we wish to classify is:

16
Naive Bayesian Classification: Example
n We need to maximize P(X|Ci)P(Ci), for i= 1, 2.
n P(Ci), the prior probability of each class, can be computed using
the training tuples:

n To compute P(X|Ci), for i= 1, 2, we compute the following


conditional probabilities:

17
Naive Bayesian Classification: Example
n Using these probabilities, we obtain:

n Similarly, we have:

n To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute:

n Therefore, the naive Bayesian classifier predicts buys_computer =


yes for tuple X .
18
Naive Bayesian Classification: A Trick
n Recall that Eq. (8.12) says:

n What if we encounter probability values of zero?


n Using the previous example, what if there are no training

tuples representing students for the class buys_computer = no ,


which results in P (student = yes|buys_computer = no) = 0?

n Plugging this zero value into Eq. (8.12) would return a zero
probability for P(X|Ci).
19
Naive Bayesian Classification: A Trick

n There is a simple trick to avoid this problem.

n We can assume that our training data set, D, is so large that adding
one to each count that we need would only make a negligible
difference to the estimated probability value, yet would
conveniently avoid the case of probability values of zero.
n This technique for probability estimation is known as the
Laplacian correction or Laplace estimator.

n Note that if we need to modify the counts, we have to revise the


denominator used in the probability calculation accordingly.

20
Naive Bayesian Classification: A Trick

n Suppose that for the class buys_computer = yes in some training


database, D, containing 1000 tuples, we have:
n 0 tuples with income = low,

n 990 tuples with income = medium,

n 10 tuples with income = high.

n The probabilities of these events, without the Laplacian correction,


are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000),
respectively.

n Using the Laplacian correction for the three quantities, we pretend


that we have 1 more tuple for each income-value pair.
21
Naive Bayesian Classification: A Trick

n In this way, we instead obtain the following probabilities:

n The “corrected” probability estimates are close to the


“uncorrected” original probabilities, yet the zero probability
value is avoided.

22

You might also like