0% found this document useful (0 votes)

10 views

Chapter 5 Machine Learning

Machine learning (ML) is a sub-field of computer science that enables computers to learn from experience and improve performance on tasks without explicit programming. It encompasses various types of learning such as supervised, unsupervised, semi-supervised, and reinforcement learning, and is applied in diverse areas like spam detection, medical diagnosis, and natural language processing. The process involves data collection, preprocessing, feature engineering, model construction, and validation to classify or predict outcomes based on input data.

Uploaded by

smegnembiale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Chapter 5 Machine Learning

Uploaded by

smegnembiale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 96

Chapter: 5

Machine Learning
What do you know about machine learning?
AI vs ML vs DL
What is machine learning ?

• It
means that ML is able to perform a specified
task without being directly told how to do it.

• Example:
• Distinguish between spam and valid email messages.
Given a set of manually labeled good and bad email
examples, an algorithm can automatically learn a set
of rules that distinguish them.

•
Language Identification ( Amharic, Ge'ez, Tigrigna,
Afar, etc) ( How?)

• Arthur Samuel (1959) defined machine learning as “a sub-field of computer science that gives computers
the ability to learn without being explicitly programmed.”
What is machine learning ?
• A widely accepted formal definition by Tom Mitchell (1997, professor of Carnegie
Mellon University):
• A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at the tasks T , as
measured by P, improves with the experiences.

• In short
• A setof computer programs that automatically learn past
from (examples or training corpus) experiences

• Example: According to this definition, we can reformulate the email problem as the task of
identifying spam messages (task T) using the data of previously labeled email messages
(experience E) through a machine learning algorithm with the goal of improving the
future email spam labeling (performance measure P)
What is machine learning ?

⚫ ML aims to select, explore and extract useful knowledge from complex,

often non-linear data, building a computational model capable of
describing unknown patterns or correlations, and in turn, solve challenging
problems.

⚫ This learning process is often carried out through repeated exposure to the
defined problem (training dataset), allowing the model to achieve self-
optimization and continuously enhance its ability to solve new, previously
unseen problems (test dataset).
Applications of Machine learning

Predication (weather, medical, agricultural yield)

Natural language processing Image segmentation Spam Detection

Multimedia event detection
Speech Recognition MT
Surveillance and security system Cancer detection/classification
sentiment classification
Character recognition Face recognition Object detection and recognition
ML vs Traditional
Programming
Traditional programming

Program
Computation Results
Data

Machine Learning

Data
Computation Program
Results
Classes of Machine Learning problem

• Supervised Learning

• Unsupervised Learning

• Semi-supervised Learning

• Reinforcement Learning
Classes of machine learning Problem

Supervised Learning
• Learn to predict output when given an input vector
• Training data includes desired outputs
Machine learning structure
Classes of machine learning Problem

Unsupervised Learning
• The aim is to uncover the underlying structures (classes or clusters) in the
data
• Training data does not include desired outputs. This is the new
frontier of machine learning because most big datasets do not come with
labels.

Clustering

Dimensionality reduction
Machine learning structure
Machine learning structure
• Semi-supervised Learning
• Desired outputs or classes are available for only a part of
the training data.
• This approach is useful when it is impractical
or too
expensive to access or measure the target variable for all
participants
Machine learning structure
Reinforcement Learning
• Learning method that interacts with its environment by
producing
actions and discovers errors or rewards.
• On the basis of trial and error, to discover what actions
maximize reward and minimize the penalty.
The Learning Problem
• Given <x,f(x)> pairs, infer f

x f(x) Given a finite sample, it is often

impossible to guess the true function f.
1 1
2 4 Approach: Find some pattern (called a
hypothesis) in the training examples, and
3 9 assume that the pattern will hold for
future examples too.
4 16
5 ?
The machine learning framework

y = f(x)

outp predictio feature

ut n
functio
n
• Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},
estimate the prediction function f by minimizing the prediction error on
the training set
• Testing: apply f to a never before seen test example x and output the
predicted value y = f(x)
Learning—A Two-Step Process

• Model construction:
• A training set is used to create the model.
• The model is represented as classification rules, decision
trees, or mathematical formula

• Model usage:
• the test set is used to see how well it works for classifying
future or unknown objects
Step 1: Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant 3 no (Model)
Prof
Mary Assistant 7 yes
Prof
Bill Professor 2 yes
IF rank = ‘professor’
Jim Associate 7 yes OR years > 6
Prof THEN tenured =
Dave Assistant 6 no ‘yes’
Step 2: Using the Model in Prediction

Classifier
model

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant 2 no Tenured?
Prof
Merlis Associate 7 no
a Prof
Georg Professor 5 yes
Challenges in Machine Learning

• Efficiency and scalability of machine learning algorithms

• Handling high-dimensionality
• Handling noise, incomplete and imbalanced data
• Pattern evaluation and knowledge integration
• Protection of security, integrity, and privacy in machine
learning
• Data acquisition and representation issues
• Degree of interpretability for predictive power
• Deployment issues
Basic Steps in Machine Learning
1. Data collection
⚫ “training data”, mostly with “labels” provided by a “teacher”;
2. Data preprocesing
⚫ Clean data to have homogenity
3. Feature engineering
⚫ Select represenatative features to improve performance
4. Modeling
⚫ choose the class of models that can describe the data
5. Estimation/Selection
⚫ find the model that best explains the data: simple and fits well;
6. Validation
⚫ evaluate the learned model and compare to solution found using other model
classes;
7. Operation
⚫ Apply learned model to new “test” data or real world instances
ML Background – Terminology

•Let’s review some common ML terms.

•Data is usually represented with a feature matrix. Features

• Features Attributes used to classify instances

• Attributes used for analysis F1 F2 F3 F4 F5

• Represented by columns in feature C1 41 1.2 2 1 3.6
matrix

label
class label
• Instances/feature vector C2 63 1.5 4 0 3.5

• Entity with certain attribute values

has aa class
C1 109 0.4 6 1 2.4
• Represented by rows in feature matrix C1 34 0.2 1 0 3.0

instance has
Each instance
C1 33 0.9 6 1 5.3
• Class labels
C2 565 4.3 10 0 3.2
• Indicate category for each instance.

Each
C1 21 4.3 1 0 1.2
• This example has two classes (C1 and
C2). C2 35 5.6 2 0 9.1
• Only used for supervised learning Instances

Fig. 1 Feature matrix

Classification
The classification problem aims to predict group membership (i.e., labels or classes), for
a set of observations.

Classification is a two-step process: a model construction (learning) phase, and a

model usage (applying) phase. Sometimes model adjustment/by validation considered
as 3rd step.

Formally:
Given a set of data-point X= {x1, x2, ..., xn} and a finite set of targate classes Y= {y1, y2, ..., ym},
Classification problems is to define a mapping f : X→Y, where each xi is assigned to one of class (yi).
A bank officer needs analysis of data to learn which loan
applications are “safe”/”risky” Classification

A marking manager at hardware store needs to help

guess whether a customer with a given profile buy a
computer , “Yes”/”No”
These data
A medical researcher wants to analyze breast cancer analysis task are
data to predict which treatments should a patient Classification
receive? “treatment A”/”treatment B”/”treatment C”
Classification
The predict classes: “safe or “risky”, “Yes” or “No”, “treatment A”, “treatment B”,
and “treatment C” are categorical which have discrete values, where ordering among
them has no meaning.
How much a given customer will spent – is a numeric prediction
Classification
Data classification is a two step process:
a) Training phase : - where a classification algorithm builds the classifier by analysing or “learning from”
a training set made up of database tuples and their associated class labels
The individual tuples making up the training set are referred to as training tuples
Data tuples can be referred as samples, instances, examples, data points or objects
If class labels of each training tuple is given . It is called supervised learning., in that the classifier is told
to which class each training tuple belongs.
Where as, in unsupervised learning the class label of the training tuples(samples) is not known.
b) Testing phase – : the model is used for classification after the predictive accuracy of the classifier is
estimated.
Test set made up of test tuples , which are independent of training tuples are used to test the
predictive accuracy of the model
Prediction Problems: Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data

Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is

Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or missing values

Typical applications
Predicting house prices based on size, location, etc.
Forecasting sales revenue for a company
Estimating a student’s exam score based on study hours
Classification—A Two-Step Process

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Classification

TID Age income Loan_decison

1 youth Low Risky

2 Youth Low Risky
3 Middle_aged High Safe
4 Middle_aged Low Risky
5 Senior Low Safe
6 Senior Medium Safe
7 Middle_aged high Safe

Training phase
Classification
Model

Testing phase
Common Classification Methods
K-Nearest-Neighbor
Decision Tree
Naïve Bayesian
Support Vector Machine (SVM)
Artificial Neural Network (ANN)
k-Nearest Neigbhour (k-NN) Classification
 In k-nearest-neighbor (k-NN) classification, the training
dataset is used to classify each member of a ― target dataset.

 There is no model created during a learning phase but the

training set itself.

Is known as a lazy-learning method unlike eager-learners like

DT, Naïve Bayes etc.

 Rather than building a model and referring to it during the

classification, k-NN directly refers to the training set for
classification.
Classification - K nearest neighbor (KNN)
Eager Learners : - construct model from a set of training instances before testing a new
instance to classify.
The learned model is eager to classify unseen instances
Lazy Learners :- stores the entire dataset in the memory, waits until a test instance is
given to it, then it constructs the model.
Classifies the new instance based on its similarity to the stored training instances.
Do less work when a training instances is given, and do more when classifying new instances or numeric
predictions.
Also called instance-based learners, since they store the training instances .
Are computationally expensive during classification or numeric predictions.
Support incremental learning.
KNN, cased-based reasoning classifiers are examples of lazy learners
Classification - K nearest neighbor (KNN)
KNN was introduced in the 1950s.
KNN classifiers are based on learning by analogy, i.e. by comparing a new instance
with training instances that are similar to it.
Training instances are defined by n attributes, which can be represented in an n-
dimensional pattern space.
When a new instance is presented to it, KNN searches the pattern space for the k
training instances that are nearest to the new instance.
The k training instances are the k nearest neighbors of the new instance.
Classification - K nearest neighbor (KNN)
Similarity/closeness is defined in terms of a distance metric.
The nearest neighbour are defined in terms of Euclidean distance, dist(X1, X2)

When there is an large range between or among attributes (e.g. between income = 42,
000 and age =35), in order to prevent large value attributes outweigh small values,
normalization is performed. E.g. minmax normalization

minA and maxA are the min and max value of

attribute A. v is the current value of attribute A
Classification - K nearest neighbor (KNN)
Calculate the normalized values of the
attribute salary and experience.

Name Salary Experience Normalized_Salary Normalized_experience

Abebe 10,000 7 0.4 0.4
Birtukan 15,000 10 1 1
Melese 14,500 9 0.9 0.8
Yonas 10,000 6 0.4 0.2
Almaz 10,000 5 0.4 0
Metadel 7000 7 0 0.4
Classification - K nearest neighbor (KNN)
In KNN, a new instance is assigned to the majority class among the k nearest neighbors.
When k=1, the new instance is assigned to the class of the training instance that is closest
to the to it in the pattern space.
KNN can also be used for numeric prediction. The classifier returns the average value of
the real-valued labels associated with the k-nearest neighbors of the new instance.
For nominal attributes, a simple method is to compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
If X1 and X2 are identical, the difference is zero.
If instances of X1 and X2 are different, the difference is 1.
Nearest Neighbor Classifiers

 Basic idea: The basic idea of nearest-neighbor models is that the

properties of any particular input X are likely to be similar to those
of points in the neighborhood of X
◗ E.g., If the object walks like a duck, quacks like a duck, then it‘s probably a
duck

Compute
Test
Distance Record

Training Choose k of the

Records nearest records
Definition of Nearest Neighbor

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Three points required to deal with KNN

The set of stored records
Distance Metric to compute distance between records
The value of k, the number of nearest neighbors to retrieve
Nearest Neighbor Classification
Nearest Neighbor Classification

 Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes
Steps of K-NN Algorithm

 k-nearest neighbors algorithm steps.

Step 1. Determine parameter k= number of nearest neighbors

Step 2. Calculate the distance between the query-instance and all the training samples

Step 3. Sort the distance and determine nearest neighbors based on the k-th minimum distance

Step 4. Gather the category Y of the nearest neighbor

Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the
query instance
k-NN Example
Acid Durability Strength Classification
(X1) (X2 ) (Y)
 k-nearest neighbors algorithm steps.
Step 1. Determine parameter k= number of
7 7 Bad
nearest neighbors
7 4 Bad
Step 2. Calculate the distance between the
query-instance and all the training
samples 3 4 Good
Step 3. Sort the distance and determine
nearest neighbors based on the k-th 1 4 Good
minimum distance
Question: A factory produces a new
Step 4. Gather the category Y of the nearest
paper tissue that passes the laboratory
neighbor
test with X1= 3 and X2 =7. Classify this
Step 5. Use simple majority of the category
of nearest neighbors as the prediction paper as good or bad using k-nearest
value of the query instance neighbor method.
k-NN Example
Step2. Calculate the distance between the query-
instance and all the training examples

Acid Durability Strength Square distance to query instance

(X2 ) (3,7)
(X1 )
7 7 (7-3)2 + (7-7)2 = 16

7 4 (7-3)2 + (4-7)2 =25

3 4 (3-3)2 + (4-7)2 =9

1 4 (1-3)2 + (4-7)2 =13

k-NN Example
Step 3. Sort the distance and determine nearest neighbors based on
the k-th minimum distance

Acid Durability Strength Square distance to query Rank Is it included

(X1 ) (X2) instance (3,7) minimum in 3-nearest
distance neighbors?
7 7 (7-3)2 + (7-7)2 =16 3 Yes

7 4 (7-3)2 + (4-7)2 =25 4 No

3 4 (3-3)2 + (4-7)2 =9 1 Yes

1 4 (1-3)2 + (4-7)2 =13 2 Yes

k-NN Example
Step 4. Gather the category Y of the nearest neighbors.
Notice that the second row last column that the category of
nearest neighbor
(Y) is not included
Acid Durability
because
Strength the rank
Square distance to queryof this
Rank data
Is itis more Y=
included than 3 of
Category
(=k). (X2 ) instance (3,7) minimum in 3-nearest nearest
distance neighbors? neighbor
(X1 )

7 7 (7-3)2 + (7-7)2 =16 3 Yes Bad

7 4 (7-3)2 + (4-7)2 =25 4 No -

3 4 (3-3)2 + (4-7)2 =9 1 Yes Good

1 4 (1-3)2 + (4-7)2 =13 2 Yes Good

k-NN Example
 Step 5. Use simple majority of the category of nearest neighbors as the
prediction value of the query instance.

 In this example we have 2 good and 1 bad, since 2>1, we

conclude that a new paper tissue that pass lab test with X1 =3 and X2 =
7 is classified as GOOD.

* Please try (x1=2, x2=6)

Nearest Neighbor Classification
Advantages of KNN
It is extremely easy to implement
Requires no training prior to making real time predictions. This makes the KNN algorithm much faster than
other algorithms that require training e.g SVM, linear regression.
There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g.
Euclidean or Manhattan etc.)
Disadvantages of KNN
The KNN algorithm doesn't work well with high dimensional data because with large number of
dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
The KNN algorithm doesn't work well with categorical features since it is difficult to find the distance
between dimensions with categorical features.
Classification - K nearest neighbor (KNN)
Application areas of KNN
 Text mining
 Agriculture
 Finance
 Medical
 Facial recognition
 Recommendation systems (Amazon, Hulu, Netflix, etc)

Source: https://medium.com/@arman_hussain786/k-nearest-neighbors-knn-and-its-applications
Classification – Decision Tree
A Decision Tree is a flowchart-like tree structure, where:
each internal node (non-leaf node) denotes a test on an attribute,
each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label

Root node

Leaf nodes
Classification – Decision Tree
How does classification is performed using DT?
Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested
against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for
that tuple.
Decision trees can easily be converted to classification rules.

Question: is DT supervised or unsupervised learning?

Classification – Decision Tree
Why DTs are popular classifier methods:-
Generation of trees doesn’t require any domain knowledge
Can handle multidimensional data
The learning and classification steps of DT induction are simple and fast
Generally DTs have good accuracy.
Decision tree induction
algorithms have been used for classification in many application areas such as
medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology
Classification – Decision Tree
Attribute selection measures – used to choose the attribute that best dissects the
instances into unique/distinct classes
Information gain
Gain ratio
Gini index
Tree pruning – while building trees, many of the attributes may reflect noise or
outliers in the training set. Therefore, tree pruning is undertaken to remove noise and
outliers, with the goal of improving classification accuracy.
Classification – Decision Tree
Popular decision tree algorithms:
ID3 (Iterative Dichotomiser) by J. Ross Quinlan
C4.5
CART (Classification and Regression Tree)
All the three adopt greed approach in which DTs are build in top-down recursive divide-and-conquer
manner.
While DTs are built training examples are recursively divided into smaller
units/subsets.
Classification – Decision Tree
Splitting attribute types

Discrete
valued
attribute

Continuous- valued
attribute

Discrete-valued and
binary tree.
Classification – Decision Tree
- attribute selection measures
Also known as splitting rules, since they determine how the instances at a certain
node are to be split.
Provided that training instances, the attribute selection measure gives a ranking for
each attribute.
The attribute with highest score is chosen as splitting attribute
The splitting criterion is used to label for a tree node created for partition of the
dataset.
Branches grow according to the outcome of the splitting criterion, and instances are
partitioned accordingly.
Classification – Decision Tree
- attribute selection measures
1) Information gain:
It is based on the theory of information proposed by Claude Shannon.
ID3 uses it as attribute selection measure
Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is
chosen as the splitting attribute for node N
The attribute minimizes the information needed to classify the instances
Entropy
• D - example set, C1,...,CN - classes
• Entropy E(D) – measure of the impurity of training set
S

Info(D) is the average amount of information needed to identify the class label
of a tuple in D.
• Entropy in binary classification
problems
E(D) = - p+ log2p+ - p- log2p-
Entropy
• E(D) = - p+ log2p+ - p- log2p-
• The entropy function relative to a Boolean
classification, as the proportion p+ of positive
examples varies between 0 and 1
1
0,9
0,8
0,7
Entropy(S)

0,6
0,5
0,4
0,3
0,2
0,1
0
0 0,2 0,4 0,6 0,8 1 p+
What is entropy?
• Entropy E(D) = expected amount of information (in bits) needed to
assign a class to a randomly drawn object in S under the optimal,
shortest-length code

• Information theory: optimal length code assigns -log2p bits to a

message having probability p

• So, in binary classification problems, the expected number of bits

to encode + or – of a random member of S is:

p+ ( - log2p+ ) + p- ( - log2p- ) = - p+ log2p+ - p- log2p-

Information-Theoretic Approach
To classify an object, a certain information is needed
Info(D), information
After we have learned the value of A, we only need some remaining amount of information to
classify an object
Ires, residual information
The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes the Gain:
Gain
Gain(A) = Info(D)– Ires(A)
Residual Information

After applying attribute A, S is partitioned into subsets according to values v of A

Ires represents the amount of information still needed to classify an instance
Ires is equal to weighted sum of the amounts of information for the subsets
Ires = infoA(v)

p(c|v) = probability that an

instance belongs to class C
given that it belongs to v

= InfoA(v)
Information Gain:
= the amount of information I rule out by splitting on attribute A:
Gain(A) = Info(D) – Ires(A)
= information in the current set minus the residual information
after splitting

The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes
the Gain
Classification – Decision Tree (ID3)
Class_yes = 9, Class_no = 5
age v=youth(Yes) = 2 age v=youth(No) = 3
age v = middle_age(Yes)= 4 age v =middle_age(No)=0
age v= Senior(Yes) =3 age v=senior(No)=2

Calculate Entropy of the data base (D)

Calculate Entropy of age (residual information)

Classification – Decision Tree (ID3)

Gain(age) = 0.246 bits

Gain(Income) = 0.029 bits
Gain(student) = 0.151 bits
Gain(credit_rating)= 0.048

Source: Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed.
Classification – Decision Tree (ID3)
Question:
1. Calculate
Gain(income)
Gain (student)
Gain(credit_rating)
2. And identify the highest Gain.
3. Build the decision tree
Classification – Decision Tree (ID3)
TID Age income Loan_decison Class_Safe = 7, Class_Risky = 3
1 youth Low Risky age v=youth(Safe) = 2 age v=youth(Risky) =2
2 Youth Low Risky age v = middle_age(Safe)= 2 age v =middle_age(Risky)=1
3 Middle-aged High Safe age v= Senior(Safe) =3 age v=senior(Risky)=0
4 Middle-aged Low Risky
5 Senior Low Safe Calculate Entropy of the data base (D)
6 Senior Medium Safe
7 Youth High Safe
8 Middle-aged Medium Safe Calculate Entropy of age (residual information)
9 Youth Medium Safe
10 Senior High Safe
When to use
Decision Tree Learning?
Appropriate problems for
decision tree learning

Classification problems

Characteristics:
instances described by attribute-value pairs
target function has discrete output values
training data may be noisy
training data may contain missing attribute values
Strengths

can generate understandable rules

perform classification without much computation
can handle continuous and categorical variables
provide a clear indication of which fields are most important for
prediction or classification
Weakness

Not suitable for prediction of continuous attribute.

Perform poorly with many class and small data.
Computationally expensive to train.
At each node, each candidate splitting field must be sorted before its best split can
be found.
In some algorithms, combinations of fields are used and a search must be made for
optimal combining weights.
Pruning algorithms can also be expensive since many potential sub-trees must be
formed and compared
Classification – Support Vector Machine
Classification – Support Vector Machine
Can classify both linear and nonlinear data.
It is a supervised classification algorithm.
They are highly accurate, even though the training time is slow.
SVM is less prone to overfitting than other methods
Can be for classification and numeric prediction as well.
Applications: Handwritten digit recognition, object recognition, speaker
identification, and time-series prediction
Classification – Support Vector Machine
SVM for linear separable problem
let’s consider an example based on two input attributes, A1 and A2
a straight line can be drawn to separate all the tuples of class C1 from all the tuples of class −1, where there
are infinite number of separating lines that could be drawn.
We want to find the best hyperplane that would separate the two classes.
Classification – Support Vector Machine
SVM searches for the maximum marginal hyper plane (MMH)
Hyperplane with larger margin can better and accurately separate instances than the
one with smaller margin.
During training, SVM attempts to search for a hyperplane with the largest margin, i.e.
the maximum marginal hyper plane (MMH)
Classification – Support Vector Machine
The separating hyperplane can be represented as:

Where W is weight vector, W = {w1, w2, w3, …., Wn}; n is the number of attributes
and b is a scalar(bias).
Given a training instance X = (x1, x2) where x1 and x2 are values of attributes A1
and A2 respectively, let b be initial weight, then:

Any point above and below the hyper plane are represented by:
Classification – Support Vector Machine

Source: https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
Classification – Support Vector Machine
The hyperplanes defining the sides of the margin can be written as:

Any instance that falls on or above H1 belongs to class +1 and any instance that falls on
or below H2 belongs to -1.
Any training instances that fall on the hyperplanes H1 and H2, the sides defining the
margin, are called support vectors. They are equally close to the MMH.
The distance from the separating hyperplane to any point on H1 is
||W|| is the Euclidean norm of W, i.e.
Which is equal to the distance from bay point to H2
Classification – Support Vector Machine
SVM for linear inseparable instances.
SVM can also be used to classify non-linear problems.

 Transform the original input data into a higher dimensional space

using a nonlinear mapping
 Once the data have been transformed into the new higher space, the second step searches for a linear
separating hyperplane in the new space
Kernel functions are used in non-linear SVM

Source: https://towardsdatascience.com/support-vector-machine
Classification – Support Vector Machine
Application of SVM
Face detection
Text and hypertext categorization
Classification of images
Bioinformatics
Handwriting recognitions
…and more

Question: Can we SVM for multiclass classification?

Classification – Naïve Bayes (NB)
Is one of the Bayes Classification methods, which are statistical classifiers.
Can predict class membership probability of a given instance.
Assumption:
Class conditional independence – which means the effect of an attribute value on a given class is
independent of the values of the other attributes.
That is why Naïve bayes , named as “naïve “
Classification – Naïve Bayes (NB)
Let X be an instance and H be a hypothesis that the instance X belongs to a certain class C.
We want to determine Posterior probability, P(H|X), which is the probability that the
hypothesis H holds given the instance X. Prior
Posterior Probability
probability

P(H) is the prior probability of H, for example the probability that any given customer will
buy a computer, regardless of age and income.
P(X|H) is the posterior probability of X conditioned on H.
P(X) is the prior probability, e.g. the probability that a person is 25 yrs & income = 40,000.
P(H), P(X|H) and P(X) can be computed from the dataset
Classification – Naïve Bayes (NB)
Given a dataset D, containing instances, denoted by X, and Class labels (denoted by C i ).
Naïve Bayes classifies an instance X to a class having the highest posterior probability
conditioned on X. NB predicts the class of X using:

X belong to the class Ci if and only if:

P(X) is constant for all classes, hence we only maximize

Classification – Naïve Bayes (NB)

In a dataset with many attributes, the class-conditional independence assumption is

made, where it presumes the attributes’ values are conditionally independent of one
another (no dependence relationship among the attributes.
Is computed from
Hence, the instances in the
dataset

If the attribute X=x1, is categorical, the P(x1|Ci) is the number of instances of class Ci in
the dataset having the value x1 divided by the total number of instances in the dataset
If x1 is continuous-valued, then we assume the attributes have Gaussian distribution with
mean and standard deviation.

therefore,
Classification – Naïve Bayes (NB)

Let C1 be buys_computer =yes and C2

be buys_computer = no
Question : given the dataset to which
class does the instance:
X = (age=youth, income=medium,
student = yes, credit_rating=fair)
Calculate the following:
P(C1 = yes) = ?
P(C2 = no) = ?
To compute P(X|Ci), first compute the
conditional probabilities :
Classification – Naïve Bayes (NB)
Question : given the dataset to which class does the
instance:
X = (age=youth, income=medium, student = yes,
credit_rating=fair)
Let C1 be buys_computer =yes and C2 be
buys_computer = no
Calculate the following:
P(C1 = yes) = 9/14 = 0.643
P(C2 = no) = 5/14 = 0.357
To compute P(X|Ci), first compute the conditional
probabilities :

Therefore, the classifier predict X to the class

C1, which is buyes_computer = yes.
Classification – Naïve Bayes (NB)
Application of Naïve Bayes Classification algorithm:
Text classification
Spam filtering
Sentiment analysis
Multi-class classification
Recommendation system
Credit scoring
Medical data classification
Real-time predictions
Assignment
Given the table, predict the class for the
following instances
1) X = (age=Senior, income=medium, student =
yes, credit_rating=excellent)
2) Z = (age=middle_aged, income=low, student
= no, credit_rating=fair)
3) R = (age=Youth, income=medium, student =
no, credit_rating=fair)
Neural Network
Also connectionist model or parallel distributed processing
A neural network is an interconnected assembly of simple processing
elements, units or nodes, whose functionality is loosely based on the
animal neuron.
The processing ability of the network is stored in the interunit
connection strengths, or weights, obtained by a process of adaptation
to, or learning from, a set of training patterns.
Learning in Neural Network

NN is represented as a layered set of interconnected processors.

These processor nodes has a relationship with the neurons of the
brain.
A network with the input and output layer only is called
single-layered neural network. Whereas, a multilayer neural
network is a generalized one with one or more hidden layer.
A network containing two hidden layers is called a three-layer neural
network, and so on.
Learning in Neural Network
ANN: often called neural network or simply neural net(NN)
An extremely simplified model of the brain
Transforms inputs into outputs to the best of its ability
Composed of many “neurons” that co-operate to perform
the desired function
An artificial representation of human brain that tries to
simulate its learning process
Nodes and Layer
• Each node has a weighted connection to several other nodes in adjacent layers.
Individual nodes take the input received from connected nodes and use the
weights together to compute output values.
• The inputs are fed
simultaneously into the input
layer.
• The weighted outputs of
these units are fed into
hidden layer.
• The weighted outputs of the
last hidden layer are inputs to
units making up the output
layer.
Architecture of Neural network
Neural networks are used to look for patterns in data, learn these
patterns, and then classify new patterns & make forecasts
A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network is a
generalized one with one or more hidden layer.
A network containing two hidden layers is called a three-layer neural
network, and so on.

Multilayer NN
Single layered NN n
x1 x1
w1 o  (  wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
 ( y) 
1  e y Input Hidden Output
nodes nodes nodes
Architecture of Neural net
• A network containing two hidden layers is called a three-layer
neural network, and so on.

2 layer NN 3 layer NN

30 Essentials For Using AI
100% (2)
30 Essentials For Using AI
137 pages
Solution Solution: A E D B C
No ratings yet
Solution Solution: A E D B C
25 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
73 pages
1. Machine Learning - Introduction
No ratings yet
1. Machine Learning - Introduction
138 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
01 - ML - Introduction (1)
No ratings yet
01 - ML - Introduction (1)
65 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
68 pages
01ML Introduction
No ratings yet
01ML Introduction
80 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
unit 01
No ratings yet
unit 01
32 pages
An Enlightenment To Machine Learning
100% (1)
An Enlightenment To Machine Learning
16 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
65 pages
1 - Introduction
No ratings yet
1 - Introduction
82 pages
Intro To ML
No ratings yet
Intro To ML
107 pages
Ch3-Machine Learning
No ratings yet
Ch3-Machine Learning
124 pages
Ch7 Introduction to Machine Learning
No ratings yet
Ch7 Introduction to Machine Learning
29 pages
Big-Data Unit-3
100% (1)
Big-Data Unit-3
54 pages
INTRODUCTION
No ratings yet
INTRODUCTION
51 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Module 2 - ML
No ratings yet
Module 2 - ML
53 pages
L02 Fundamentals of ML
No ratings yet
L02 Fundamentals of ML
39 pages
Unit-1 Introduction To Machine Learning
No ratings yet
Unit-1 Introduction To Machine Learning
24 pages
Elements of Machine Learning
No ratings yet
Elements of Machine Learning
116 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
Week 8
No ratings yet
Week 8
70 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
AML All Merged PDF Class 1 To 8
No ratings yet
AML All Merged PDF Class 1 To 8
423 pages
ML Merged
No ratings yet
ML Merged
433 pages
Introduction To Machine Learning-1
No ratings yet
Introduction To Machine Learning-1
28 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
ETI microproject
No ratings yet
ETI microproject
11 pages
CH 4
No ratings yet
CH 4
106 pages
Lecture 1.1. Introduction
No ratings yet
Lecture 1.1. Introduction
48 pages
ML 1
No ratings yet
ML 1
35 pages
Module2 ch2
No ratings yet
Module2 ch2
36 pages
Inductive Learning and Machine Learning
100% (1)
Inductive Learning and Machine Learning
321 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
AI Session 3 Machine Learning Slides
No ratings yet
AI Session 3 Machine Learning Slides
35 pages
ML 3RD Unit
No ratings yet
ML 3RD Unit
67 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
Intro Machine Learning
No ratings yet
Intro Machine Learning
4 pages
4.1 Machine Learning Basics
No ratings yet
4.1 Machine Learning Basics
26 pages
Classification
No ratings yet
Classification
22 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
49 pages
ML Lec 1
No ratings yet
ML Lec 1
47 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
28 pages
ENG6500 1 IntroductionToMLDL Part1
No ratings yet
ENG6500 1 IntroductionToMLDL Part1
63 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Python UNIT-5
100% (1)
Python UNIT-5
67 pages
L02 Fundamentals of ML
No ratings yet
L02 Fundamentals of ML
46 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
Module 1 ML
No ratings yet
Module 1 ML
51 pages
1_AML _Manish
No ratings yet
1_AML _Manish
72 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Marketing Education - Handling Questions & Concerns
No ratings yet
Marketing Education - Handling Questions & Concerns
12 pages
Oilfield Services Companies in India.
No ratings yet
Oilfield Services Companies in India.
8 pages
Pressure Sensors - 004701070
No ratings yet
Pressure Sensors - 004701070
2 pages
Adam S. Tricomo: Concordia University Irvine, Irvine, CA
No ratings yet
Adam S. Tricomo: Concordia University Irvine, Irvine, CA
2 pages
RET Unit-1 Solar Thermal Systems
No ratings yet
RET Unit-1 Solar Thermal Systems
12 pages
Handout 4 (Part2-3) - W4 - For Students (Edited)
No ratings yet
Handout 4 (Part2-3) - W4 - For Students (Edited)
6 pages
Mil STD 883 2
No ratings yet
Mil STD 883 2
385 pages
CONTROL 03 2018 30XAS XA XA-ZE XB XBP XW XW-ZE tcm478-51332
100% (1)
CONTROL 03 2018 30XAS XA XA-ZE XB XBP XW XW-ZE tcm478-51332
44 pages
Risk Analysis Report ISMA Erlanger
No ratings yet
Risk Analysis Report ISMA Erlanger
32 pages
Factorisation Class 8
No ratings yet
Factorisation Class 8
3 pages
Mathematics 5 Q2-Week 1
No ratings yet
Mathematics 5 Q2-Week 1
28 pages
Choose and Design An Experiment: Experiments / Your Turn
No ratings yet
Choose and Design An Experiment: Experiments / Your Turn
4 pages
Design of Column As Per IS-456: Steel
No ratings yet
Design of Column As Per IS-456: Steel
1 page
T1 - Introduction of Surveying PDF
No ratings yet
T1 - Introduction of Surveying PDF
34 pages
Cel2106 Class Material 4 (Week 7-10)
No ratings yet
Cel2106 Class Material 4 (Week 7-10)
20 pages
Origin of Life
No ratings yet
Origin of Life
33 pages
ME685 Homework3
No ratings yet
ME685 Homework3
16 pages
Preconference WS1 (IQRC2023) Qualitative Study
No ratings yet
Preconference WS1 (IQRC2023) Qualitative Study
70 pages
Online Marites
No ratings yet
Online Marites
13 pages
List of Important Days and Dates in Year 2023
No ratings yet
List of Important Days and Dates in Year 2023
11 pages
Critical Reading and Thinking Strategies
No ratings yet
Critical Reading and Thinking Strategies
86 pages
782F233 ESAB WeldingDefects Poster NA EN wCropsBleeds Hi
No ratings yet
782F233 ESAB WeldingDefects Poster NA EN wCropsBleeds Hi
1 page
Leadership - Nature Vs Nurture
No ratings yet
Leadership - Nature Vs Nurture
2 pages
Basic Concepts of Social Research
No ratings yet
Basic Concepts of Social Research
35 pages
Specification Silicate
No ratings yet
Specification Silicate
1 page
Addis Ababa University School of Civil and Environmental Engineering 3 CED
No ratings yet
Addis Ababa University School of Civil and Environmental Engineering 3 CED
48 pages
Genealogy Societies in Quebec Genealogical Collections
100% (1)
Genealogy Societies in Quebec Genealogical Collections
14 pages
UK Energy Isolation LOTO TBT LR
No ratings yet
UK Energy Isolation LOTO TBT LR
24 pages

Chapter 5 Machine Learning

Uploaded by

Chapter 5 Machine Learning

Uploaded by

Chapter: 5

⚫ ML aims to select, explore and extract useful knowledge from complex,

Predication (weather, medical, agricultural yield)

Natural language processing Image segmentation Spam Detection

x f(x) Given a finite sample, it is often

outp predictio feature

NAME RANK YEARS TENURED Classifier

• Efficiency and scalability of machine learning algorithms

•Let’s review some common ML terms.

•Data is usually represented with a feature matrix. Features

• Attributes used for analysis F1 F2 F3 F4 F5

• Entity with certain attribute values

Fig. 1 Feature matrix

Classification is a two-step process: a model construction (learning) phase, and a

A marking manager at hardware store needs to help

Model construction: describing a set of predetermined classes

TID Age income Loan_decison

1 youth Low Risky

 There is no model created during a learning phase but the

Is known as a lazy-learning method unlike eager-learners like

 Rather than building a model and referring to it during the

minA and maxA are the min and max value of

Name Salary Experience Normalized_Salary Normalized_experience

 Basic idea: The basic idea of nearest-neighbor models is that the

Training Choose k of the

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Three points required to deal with KNN

 Choosing the value of k:

 k-nearest neighbors algorithm steps.

Step 1. Determine parameter k= number of nearest neighbors

Step 4. Gather the category Y of the nearest neighbor

Acid Durability Strength Square distance to query instance

7 4 (7-3)2 + (4-7)2 =25

1 4 (1-3)2 + (4-7)2 =13

Acid Durability Strength Square distance to query Rank Is it included

7 4 (7-3)2 + (4-7)2 =25 4 No

3 4 (3-3)2 + (4-7)2 =9 1 Yes

1 4 (1-3)2 + (4-7)2 =13 2 Yes

7 7 (7-3)2 + (7-7)2 =16 3 Yes Bad

7 4 (7-3)2 + (4-7)2 =25 4 No -

3 4 (3-3)2 + (4-7)2 =9 1 Yes Good

1 4 (1-3)2 + (4-7)2 =13 2 Yes Good

 In this example we have 2 good and 1 bad, since 2>1, we

* Please try (x1=2, x2=6)

Question: is DT supervised or unsupervised learning?

• Information theory: optimal length code assigns -log2p bits to a

• So, in binary classification problems, the expected number of bits

p+ ( - log2p+ ) + p- ( - log2p- ) = - p+ log2p+ - p- log2p-

After applying attribute A, S is partitioned into subsets according to values v of A

p(c|v) = probability that an

Calculate Entropy of the data base (D)

Calculate Entropy of age (residual information)

Gain(age) = 0.246 bits

can generate understandable rules

Not suitable for prediction of continuous attribute.

 Transform the original input data into a higher dimensional space

Question: Can we SVM for multiclass classification?

X belong to the class Ci if and only if:

P(X) is constant for all classes, hence we only maximize

In a dataset with many attributes, the class-conditional independence assumption is

Let C1 be buys_computer =yes and C2

Therefore, the classifier predict X to the class

NN is represented as a layered set of interconnected processors.

You might also like