Chapter 5 Machine Learning
Chapter 5 Machine Learning
Machine Learning
What do you know about machine learning?
AI vs ML vs DL
What is machine learning ?
• It
means that ML is able to perform a specified
task without being directly told how to do it.
• Example:
• Distinguish between spam and valid email messages.
Given a set of manually labeled good and bad email
examples, an algorithm can automatically learn a set
of rules that distinguish them.
•
Language Identification ( Amharic, Ge'ez, Tigrigna,
Afar, etc) ( How?)
• Arthur Samuel (1959) defined machine learning as “a sub-field of computer science that gives computers
the ability to learn without being explicitly programmed.”
What is machine learning ?
• A widely accepted formal definition by Tom Mitchell (1997, professor of Carnegie
Mellon University):
• A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at the tasks T , as
measured by P, improves with the experiences.
• In short
• A setof computer programs that automatically learn past
from (examples or training corpus) experiences
• Example: According to this definition, we can reformulate the email problem as the task of
identifying spam messages (task T) using the data of previously labeled email messages
(experience E) through a machine learning algorithm with the goal of improving the
future email spam labeling (performance measure P)
What is machine learning ?
⚫ This learning process is often carried out through repeated exposure to the
defined problem (training dataset), allowing the model to achieve self-
optimization and continuously enhance its ability to solve new, previously
unseen problems (test dataset).
Applications of Machine learning
Program
Computation Results
Data
Machine Learning
Data
Computation Program
Results
Classes of Machine Learning problem
• Supervised Learning
• Unsupervised Learning
• Semi-supervised Learning
• Reinforcement Learning
Classes of machine learning Problem
Supervised Learning
• Learn to predict output when given an input vector
• Training data includes desired outputs
Machine learning structure
Classes of machine learning Problem
Unsupervised Learning
• The aim is to uncover the underlying structures (classes or clusters) in the
data
• Training data does not include desired outputs. This is the new
frontier of machine learning because most big datasets do not come with
labels.
Clustering
Dimensionality reduction
Machine learning structure
Machine learning structure
• Semi-supervised Learning
• Desired outputs or classes are available for only a part of
the training data.
• This approach is useful when it is impractical
or too
expensive to access or measure the target variable for all
participants
Machine learning structure
Reinforcement Learning
• Learning method that interacts with its environment by
producing
actions and discovers errors or rewards.
• On the basis of trial and error, to discover what actions
maximize reward and minimize the penalty.
The Learning Problem
• Given <x,f(x)> pairs, infer f
y = f(x)
• Model construction:
• A training set is used to create the model.
• The model is represented as classification rules, decision
trees, or mathematical formula
• Model usage:
• the test set is used to see how well it works for classifying
future or unknown objects
Step 1: Model Construction
Classification
Algorithms
Training
Data
Classifier
model
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant 2 no Tenured?
Prof
Merlis Associate 7 no
a Prof
Georg Professor 5 yes
Challenges in Machine Learning
label
class label
• Instances/feature vector C2 63 1.5 4 0 3.5
has aa class
C1 109 0.4 6 1 2.4
• Represented by rows in feature matrix C1 34 0.2 1 0 3.0
instance has
Each instance
C1 33 0.9 6 1 5.3
• Class labels
C2 565 4.3 10 0 3.2
• Indicate category for each instance.
Each
C1 21 4.3 1 0 1.2
• This example has two classes (C1 and
C2). C2 35 5.6 2 0 9.1
• Only used for supervised learning Instances
Formally:
Given a set of data-point X= {x1, x2, ..., xn} and a finite set of targate classes Y= {y1, y2, ..., ym},
Classification problems is to define a mapping f : X→Y, where each xi is assigned to one of class (yi).
A bank officer needs analysis of data to learn which loan
applications are “safe”/”risky” Classification
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
Predicting house prices based on size, location, etc.
Forecasting sales revenue for a company
Estimating a student’s exam score based on study hours
Classification—A Two-Step Process
Training phase
Classification
Model
Testing phase
Common Classification Methods
K-Nearest-Neighbor
Decision Tree
Naïve Bayesian
Support Vector Machine (SVM)
Artificial Neural Network (ANN)
k-Nearest Neigbhour (k-NN) Classification
In k-nearest-neighbor (k-NN) classification, the training
dataset is used to classify each member of a ― target dataset.
When there is an large range between or among attributes (e.g. between income = 42,
000 and age =35), in order to prevent large value attributes outweigh small values,
normalization is performed. E.g. minmax normalization
Compute
Test
Distance Record
Step 2. Calculate the distance between the query-instance and all the training samples
Step 3. Sort the distance and determine nearest neighbors based on the k-th minimum distance
Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the
query instance
k-NN Example
Acid Durability Strength Classification
(X1) (X2 ) (Y)
k-nearest neighbors algorithm steps.
Step 1. Determine parameter k= number of
7 7 Bad
nearest neighbors
7 4 Bad
Step 2. Calculate the distance between the
query-instance and all the training
samples 3 4 Good
Step 3. Sort the distance and determine
nearest neighbors based on the k-th 1 4 Good
minimum distance
Question: A factory produces a new
Step 4. Gather the category Y of the nearest
paper tissue that passes the laboratory
neighbor
test with X1= 3 and X2 =7. Classify this
Step 5. Use simple majority of the category
of nearest neighbors as the prediction paper as good or bad using k-nearest
value of the query instance neighbor method.
k-NN Example
Step2. Calculate the distance between the query-
instance and all the training examples
3 4 (3-3)2 + (4-7)2 =9
Source: https://medium.com/@arman_hussain786/k-nearest-neighbors-knn-and-its-applications
Classification – Decision Tree
A Decision Tree is a flowchart-like tree structure, where:
each internal node (non-leaf node) denotes a test on an attribute,
each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label
Root node
Leaf nodes
Classification – Decision Tree
How does classification is performed using DT?
Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested
against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for
that tuple.
Decision trees can easily be converted to classification rules.
Discrete
valued
attribute
Continuous- valued
attribute
Discrete-valued and
binary tree.
Classification – Decision Tree
- attribute selection measures
Also known as splitting rules, since they determine how the instances at a certain
node are to be split.
Provided that training instances, the attribute selection measure gives a ranking for
each attribute.
The attribute with highest score is chosen as splitting attribute
The splitting criterion is used to label for a tree node created for partition of the
dataset.
Branches grow according to the outcome of the splitting criterion, and instances are
partitioned accordingly.
Classification – Decision Tree
- attribute selection measures
1) Information gain:
It is based on the theory of information proposed by Claude Shannon.
ID3 uses it as attribute selection measure
Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is
chosen as the splitting attribute for node N
The attribute minimizes the information needed to classify the instances
Entropy
• D - example set, C1,...,CN - classes
• Entropy E(D) – measure of the impurity of training set
S
Info(D) is the average amount of information needed to identify the class label
of a tuple in D.
• Entropy in binary classification
problems
E(D) = - p+ log2p+ - p- log2p-
Entropy
• E(D) = - p+ log2p+ - p- log2p-
• The entropy function relative to a Boolean
classification, as the proportion p+ of positive
examples varies between 0 and 1
1
0,9
0,8
0,7
Entropy(S)
0,6
0,5
0,4
0,3
0,2
0,1
0
0 0,2 0,4 0,6 0,8 1 p+
What is entropy?
• Entropy E(D) = expected amount of information (in bits) needed to
assign a class to a randomly drawn object in S under the optimal,
shortest-length code
= InfoA(v)
Information Gain:
= the amount of information I rule out by splitting on attribute A:
Gain(A) = Info(D) – Ires(A)
= information in the current set minus the residual information
after splitting
The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes
the Gain
Classification – Decision Tree (ID3)
Class_yes = 9, Class_no = 5
age v=youth(Yes) = 2 age v=youth(No) = 3
age v = middle_age(Yes)= 4 age v =middle_age(No)=0
age v= Senior(Yes) =3 age v=senior(No)=2
Source: Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed.
Classification – Decision Tree (ID3)
Question:
1. Calculate
Gain(income)
Gain (student)
Gain(credit_rating)
2. And identify the highest Gain.
3. Build the decision tree
Classification – Decision Tree (ID3)
TID Age income Loan_decison Class_Safe = 7, Class_Risky = 3
1 youth Low Risky age v=youth(Safe) = 2 age v=youth(Risky) =2
2 Youth Low Risky age v = middle_age(Safe)= 2 age v =middle_age(Risky)=1
3 Middle-aged High Safe age v= Senior(Safe) =3 age v=senior(Risky)=0
4 Middle-aged Low Risky
5 Senior Low Safe Calculate Entropy of the data base (D)
6 Senior Medium Safe
7 Youth High Safe
8 Middle-aged Medium Safe Calculate Entropy of age (residual information)
9 Youth Medium Safe
10 Senior High Safe
When to use
Decision Tree Learning?
Appropriate problems for
decision tree learning
Classification problems
Characteristics:
instances described by attribute-value pairs
target function has discrete output values
training data may be noisy
training data may contain missing attribute values
Strengths
Where W is weight vector, W = {w1, w2, w3, …., Wn}; n is the number of attributes
and b is a scalar(bias).
Given a training instance X = (x1, x2) where x1 and x2 are values of attributes A1
and A2 respectively, let b be initial weight, then:
Any point above and below the hyper plane are represented by:
Classification – Support Vector Machine
Source: https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
Classification – Support Vector Machine
The hyperplanes defining the sides of the margin can be written as:
Any instance that falls on or above H1 belongs to class +1 and any instance that falls on
or below H2 belongs to -1.
Any training instances that fall on the hyperplanes H1 and H2, the sides defining the
margin, are called support vectors. They are equally close to the MMH.
The distance from the separating hyperplane to any point on H1 is
||W|| is the Euclidean norm of W, i.e.
Which is equal to the distance from bay point to H2
Classification – Support Vector Machine
SVM for linear inseparable instances.
SVM can also be used to classify non-linear problems.
Source: https://towardsdatascience.com/support-vector-machine
Classification – Support Vector Machine
Application of SVM
Face detection
Text and hypertext categorization
Classification of images
Bioinformatics
Handwriting recognitions
…and more
P(H) is the prior probability of H, for example the probability that any given customer will
buy a computer, regardless of age and income.
P(X|H) is the posterior probability of X conditioned on H.
P(X) is the prior probability, e.g. the probability that a person is 25 yrs & income = 40,000.
P(H), P(X|H) and P(X) can be computed from the dataset
Classification – Naïve Bayes (NB)
Given a dataset D, containing instances, denoted by X, and Class labels (denoted by C i ).
Naïve Bayes classifies an instance X to a class having the highest posterior probability
conditioned on X. NB predicts the class of X using:
If the attribute X=x1, is categorical, the P(x1|Ci) is the number of instances of class Ci in
the dataset having the value x1 divided by the total number of instances in the dataset
If x1 is continuous-valued, then we assume the attributes have Gaussian distribution with
mean and standard deviation.
therefore,
Classification – Naïve Bayes (NB)
P(age=youth|C1) = ?
P(age=youth|C2) =?
P(income = medium | C1) =?
P(income = medium | C2) =?
P(student =yes | C1) =?
P(student =yes | C2) =?
P(credit_rating =fair |C1) =?
Classification – Naïve Bayes (NB)
P(X|C1) = P(age=youth|C1) x P(income =
medium | C1) x P(student =yes | C1) x
P(credit_rating =fair |C1)
= 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C2) = P(age=youth|C2) x P(income =
medium | C2) x P(student =yes | C2) x
P(credit_rating =fair |C2)
= 0.6 x 0.4 x 0.2 x 0.4 = 0.019
To find the class Ci. We compute P(X|Ci)P(Ci):
P(X|C1) P(C1) = 0.028
P(X|C2) P(C2)=0.007
Multilayer NN
Single layered NN n
x1 x1
w1 o ( wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
( y)
1 e y Input Hidden Output
nodes nodes nodes
Architecture of Neural net
• A network containing two hidden layers is called a three-layer
neural network, and so on.
2 layer NN 3 layer NN