Lecture12-Ch8-ClassBasic-Part2
Lecture12-Ch8-ClassBasic-Part2
— Chapter 8 —
Akhil Chaudhary
1
Chapter 8. Classification: Basic Concepts
2
BayesClassification: Why?
n Performance:
3
BayesClassification: Why?
4
Bayes’ Theorem
6
Bayes’ Theorem
n With the previous example, this is the probability that any given
customer will buy a computer, regardless of age, income, or any
other information, for that matter.
n The posterior probability, P(H|X), is based on more information
(e.g., customer information) than the prior probability, P(H), which
is independent of X.
8
Naive Bayesian Classification
9
Naive Bayesian Classification
n Therefore, we need to maximize P(Ci|X). The class Ci which leads to
the maximum P(Ci|X) is called the maximum posteriori hypothesis.
With Bayes’ theorem (Eq. 8.10), we have:
10
Naive Bayesian Classification
11
Naive Bayesian Classification
12
Naive Bayesian Classification
b) If Ak is continuous-valued, then we need to do a bit more work,
but the calculation is pretty straightforward. A continuous-valued
attribute is typically assumed to have a Gaussian distribution with a
mean 𝜇 and standard deviation 𝜎 , defined by:
So that:
We simply need to compute 𝜇Ci and 𝜎Ci , which are the mean (i.e.,
average) and standard deviation, respectively, of the values of
attribute Ak for training tuples of class Ci . We then plug these two
quantities into Eq. (8.13), together with xk, to estimate P(xk|Ci) .
13
Naive Bayesian Classification
n For example:
n Let X = (35, $40,000), where A1 and A2 are the attributes age and
income, respectively.
n Let the class label attribute be buys_computer.
n The associated class label for X is yes (i.e., buys_computer = yes).
n Let’s suppose that age has not been discretized and therefore
exists as a continuous-valued attribute.
n Suppose that from the training set, we find that customers in D
who buy a computer are 38 ± 12 years of age. In other words, for
attribute age and this class, we have 𝜇 = 38 years and 𝜎= 12. We
can plug these quantities, along with x1 = 35 for our tuple X, into
Eq. (8.13) to estimate P(age = 35 | buys_computer = yes).
14
Naive Bayesian Classification
In other words, the predicted class label for X is the class Ci for
which P(X|Ci)P(Ci) is the maximum.
15
Naive Bayesian Classification: Example
16
Naive Bayesian Classification: Example
n We need to maximize P(X|Ci)P(Ci), for i= 1, 2.
n P(Ci), the prior probability of each class, can be computed using
the training tuples:
17
Naive Bayesian Classification: Example
n Using these probabilities, we obtain:
n Similarly, we have:
n Plugging this zero value into Eq. (8.12) would return a zero
probability for P(X|Ci).
19
Naive Bayesian Classification: A Trick
n We can assume that our training data set, D, is so large that adding
one to each count that we need would only make a negligible
difference to the estimated probability value, yet would
conveniently avoid the case of probability values of zero.
n This technique for probability estimation is known as the
Laplacian correction or Laplace estimator.
20
Naive Bayesian Classification: A Trick
22