0% found this document useful (0 votes)
10 views

Chapter 5 Machine Learning

Machine learning (ML) is a sub-field of computer science that enables computers to learn from experience and improve performance on tasks without explicit programming. It encompasses various types of learning such as supervised, unsupervised, semi-supervised, and reinforcement learning, and is applied in diverse areas like spam detection, medical diagnosis, and natural language processing. The process involves data collection, preprocessing, feature engineering, model construction, and validation to classify or predict outcomes based on input data.

Uploaded by

smegnembiale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter 5 Machine Learning

Machine learning (ML) is a sub-field of computer science that enables computers to learn from experience and improve performance on tasks without explicit programming. It encompasses various types of learning such as supervised, unsupervised, semi-supervised, and reinforcement learning, and is applied in diverse areas like spam detection, medical diagnosis, and natural language processing. The process involves data collection, preprocessing, feature engineering, model construction, and validation to classify or predict outcomes based on input data.

Uploaded by

smegnembiale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Chapter: 5

Machine Learning
What do you know about machine learning?
AI vs ML vs DL
What is machine learning ?

• It
means that ML is able to perform a specified
task without being directly told how to do it.

• Example:
• Distinguish between spam and valid email messages.
Given a set of manually labeled good and bad email
examples, an algorithm can automatically learn a set
of rules that distinguish them.


Language Identification ( Amharic, Ge'ez, Tigrigna,
Afar, etc) ( How?)

• Arthur Samuel (1959) defined machine learning as “a sub-field of computer science that gives computers
the ability to learn without being explicitly programmed.”
What is machine learning ?
• A widely accepted formal definition by Tom Mitchell (1997, professor of Carnegie
Mellon University):
• A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at the tasks T , as
measured by P, improves with the experiences.

• In short
• A setof computer programs that automatically learn past
from (examples or training corpus) experiences

• Example: According to this definition, we can reformulate the email problem as the task of
identifying spam messages (task T) using the data of previously labeled email messages
(experience E) through a machine learning algorithm with the goal of improving the
future email spam labeling (performance measure P)
What is machine learning ?

⚫ ML aims to select, explore and extract useful knowledge from complex,


often non-linear data, building a computational model capable of
describing unknown patterns or correlations, and in turn, solve challenging
problems.

⚫ This learning process is often carried out through repeated exposure to the
defined problem (training dataset), allowing the model to achieve self-
optimization and continuously enhance its ability to solve new, previously
unseen problems (test dataset).
Applications of Machine learning

Predication (weather, medical, agricultural yield)

Natural language processing Image segmentation Spam Detection


Multimedia event detection
Speech Recognition MT
Surveillance and security system Cancer detection/classification
sentiment classification
Character recognition Face recognition Object detection and recognition
ML vs Traditional
Programming
Traditional programming

Program
Computation Results
Data

Machine Learning

Data
Computation Program
Results
Classes of Machine Learning problem

• Supervised Learning

• Unsupervised Learning

• Semi-supervised Learning

• Reinforcement Learning
Classes of machine learning Problem

Supervised Learning
• Learn to predict output when given an input vector
• Training data includes desired outputs
Machine learning structure
Classes of machine learning Problem

Unsupervised Learning
• The aim is to uncover the underlying structures (classes or clusters) in the
data
• Training data does not include desired outputs. This is the new
frontier of machine learning because most big datasets do not come with
labels.

Clustering

Dimensionality reduction
Machine learning structure
Machine learning structure
• Semi-supervised Learning
• Desired outputs or classes are available for only a part of
the training data.
• This approach is useful when it is impractical
or too
expensive to access or measure the target variable for all
participants
Machine learning structure
Reinforcement Learning
• Learning method that interacts with its environment by
producing
actions and discovers errors or rewards.
• On the basis of trial and error, to discover what actions
maximize reward and minimize the penalty.
The Learning Problem
• Given <x,f(x)> pairs, infer f

x f(x) Given a finite sample, it is often


impossible to guess the true function f.
1 1
2 4 Approach: Find some pattern (called a
hypothesis) in the training examples, and
3 9 assume that the pattern will hold for
future examples too.
4 16
5 ?
The machine learning framework

y = f(x)

outp predictio feature


ut n
functio
n
• Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},
estimate the prediction function f by minimizing the prediction error on
the training set
• Testing: apply f to a never before seen test example x and output the
predicted value y = f(x)
Learning—A Two-Step Process

• Model construction:
• A training set is used to create the model.
• The model is represented as classification rules, decision
trees, or mathematical formula

• Model usage:
• the test set is used to see how well it works for classifying
future or unknown objects
Step 1: Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant 3 no (Model)
Prof
Mary Assistant 7 yes
Prof
Bill Professor 2 yes
IF rank = ‘professor’
Jim Associate 7 yes OR years > 6
Prof THEN tenured =
Dave Assistant 6 no ‘yes’
Step 2: Using the Model in Prediction

Classifier
model

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant 2 no Tenured?
Prof
Merlis Associate 7 no
a Prof
Georg Professor 5 yes
Challenges in Machine Learning

• Efficiency and scalability of machine learning algorithms


• Handling high-dimensionality
• Handling noise, incomplete and imbalanced data
• Pattern evaluation and knowledge integration
• Protection of security, integrity, and privacy in machine
learning
• Data acquisition and representation issues
• Degree of interpretability for predictive power
• Deployment issues
Basic Steps in Machine Learning
1. Data collection
⚫ “training data”, mostly with “labels” provided by a “teacher”;
2. Data preprocesing
⚫ Clean data to have homogenity
3. Feature engineering
⚫ Select represenatative features to improve performance
4. Modeling
⚫ choose the class of models that can describe the data
5. Estimation/Selection
⚫ find the model that best explains the data: simple and fits well;
6. Validation
⚫ evaluate the learned model and compare to solution found using other model
classes;
7. Operation
⚫ Apply learned model to new “test” data or real world instances
ML Background – Terminology

•Let’s review some common ML terms.

•Data is usually represented with a feature matrix. Features


• Features Attributes used to classify instances

• Attributes used for analysis F1 F2 F3 F4 F5


• Represented by columns in feature C1 41 1.2 2 1 3.6
matrix

label
class label
• Instances/feature vector C2 63 1.5 4 0 3.5

• Entity with certain attribute values

has aa class
C1 109 0.4 6 1 2.4
• Represented by rows in feature matrix C1 34 0.2 1 0 3.0

instance has
Each instance
C1 33 0.9 6 1 5.3
• Class labels
C2 565 4.3 10 0 3.2
• Indicate category for each instance.

Each
C1 21 4.3 1 0 1.2
• This example has two classes (C1 and
C2). C2 35 5.6 2 0 9.1
• Only used for supervised learning Instances

Fig. 1 Feature matrix


Classification
The classification problem aims to predict group membership (i.e., labels or classes), for
a set of observations.

Classification is a two-step process: a model construction (learning) phase, and a


model usage (applying) phase. Sometimes model adjustment/by validation considered
as 3rd step.

Formally:
Given a set of data-point X= {x1, x2, ..., xn} and a finite set of targate classes Y= {y1, y2, ..., ym},
Classification problems is to define a mapping f : X→Y, where each xi is assigned to one of class (yi).
A bank officer needs analysis of data to learn which loan
applications are “safe”/”risky” Classification

A marking manager at hardware store needs to help


guess whether a customer with a given profile buy a
computer , “Yes”/”No”
These data
A medical researcher wants to analyze breast cancer analysis task are
data to predict which treatments should a patient Classification
receive? “treatment A”/”treatment B”/”treatment C”
Classification
The predict classes: “safe or “risky”, “Yes” or “No”, “treatment A”, “treatment B”,
and “treatment C” are categorical which have discrete values, where ordering among
them has no meaning.
How much a given customer will spent – is a numeric prediction
Classification
Data classification is a two step process:
a) Training phase : - where a classification algorithm builds the classifier by analysing or “learning from”
a training set made up of database tuples and their associated class labels
The individual tuples making up the training set are referred to as training tuples
Data tuples can be referred as samples, instances, examples, data points or objects
If class labels of each training tuple is given . It is called supervised learning., in that the classifier is told
to which class each training tuple belongs.
Where as, in unsupervised learning the class label of the training tuples(samples) is not known.
b) Testing phase – : the model is used for classification after the predictive accuracy of the classifier is
estimated.
Test set made up of test tuples , which are independent of training tuples are used to test the
predictive accuracy of the model
Prediction Problems: Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data

Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is

Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or missing values

Typical applications
Predicting house prices based on size, location, etc.
Forecasting sales revenue for a company
Estimating a student’s exam score based on study hours
Classification—A Two-Step Process

Model construction: describing a set of predetermined classes


Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Classification

TID Age income Loan_decison

1 youth Low Risky


2 Youth Low Risky
3 Middle_aged High Safe
4 Middle_aged Low Risky
5 Senior Low Safe
6 Senior Medium Safe
7 Middle_aged high Safe

Training phase
Classification
Model

Testing phase
Common Classification Methods
K-Nearest-Neighbor
Decision Tree
Naïve Bayesian
Support Vector Machine (SVM)
Artificial Neural Network (ANN)
k-Nearest Neigbhour (k-NN) Classification
 In k-nearest-neighbor (k-NN) classification, the training
dataset is used to classify each member of a ― target dataset.

 There is no model created during a learning phase but the


training set itself.

Is known as a lazy-learning method unlike eager-learners like


DT, Naïve Bayes etc.

 Rather than building a model and referring to it during the


classification, k-NN directly refers to the training set for
classification.
Classification - K nearest neighbor (KNN)
Eager Learners : - construct model from a set of training instances before testing a new
instance to classify.
The learned model is eager to classify unseen instances
Lazy Learners :- stores the entire dataset in the memory, waits until a test instance is
given to it, then it constructs the model.
Classifies the new instance based on its similarity to the stored training instances.
Do less work when a training instances is given, and do more when classifying new instances or numeric
predictions.
Also called instance-based learners, since they store the training instances .
Are computationally expensive during classification or numeric predictions.
Support incremental learning.
KNN, cased-based reasoning classifiers are examples of lazy learners
Classification - K nearest neighbor (KNN)
KNN was introduced in the 1950s.
KNN classifiers are based on learning by analogy, i.e. by comparing a new instance
with training instances that are similar to it.
Training instances are defined by n attributes, which can be represented in an n-
dimensional pattern space.
When a new instance is presented to it, KNN searches the pattern space for the k
training instances that are nearest to the new instance.
The k training instances are the k nearest neighbors of the new instance.
Classification - K nearest neighbor (KNN)
Similarity/closeness is defined in terms of a distance metric.
The nearest neighbour are defined in terms of Euclidean distance, dist(X1, X2)

When there is an large range between or among attributes (e.g. between income = 42,
000 and age =35), in order to prevent large value attributes outweigh small values,
normalization is performed. E.g. minmax normalization

minA and maxA are the min and max value of


attribute A. v is the current value of attribute A
Classification - K nearest neighbor (KNN)
Calculate the normalized values of the
attribute salary and experience.

Name Salary Experience Normalized_Salary Normalized_experience


Abebe 10,000 7 0.4 0.4
Birtukan 15,000 10 1 1
Melese 14,500 9 0.9 0.8
Yonas 10,000 6 0.4 0.2
Almaz 10,000 5 0.4 0
Metadel 7000 7 0 0.4
Classification - K nearest neighbor (KNN)
In KNN, a new instance is assigned to the majority class among the k nearest neighbors.
When k=1, the new instance is assigned to the class of the training instance that is closest
to the to it in the pattern space.
KNN can also be used for numeric prediction. The classifier returns the average value of
the real-valued labels associated with the k-nearest neighbors of the new instance.
For nominal attributes, a simple method is to compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
If X1 and X2 are identical, the difference is zero.
If instances of X1 and X2 are different, the difference is 1.
Nearest Neighbor Classifiers

 Basic idea: The basic idea of nearest-neighbor models is that the


properties of any particular input X are likely to be similar to those
of points in the neighborhood of X
◗ E.g., If the object walks like a duck, quacks like a duck, then it‘s probably a
duck

Compute
Test
Distance Record

Training Choose k of the


Records nearest records
Definition of Nearest Neighbor

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Three points required to deal with KNN


The set of stored records
Distance Metric to compute distance between records
The value of k, the number of nearest neighbors to retrieve
Nearest Neighbor Classification
Nearest Neighbor Classification

 Choosing the value of k:


If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes
Steps of K-NN Algorithm

 k-nearest neighbors algorithm steps.

Step 1. Determine parameter k= number of nearest neighbors

Step 2. Calculate the distance between the query-instance and all the training samples

Step 3. Sort the distance and determine nearest neighbors based on the k-th minimum distance

Step 4. Gather the category Y of the nearest neighbor

Step 5. Use simple majority of the category of nearest neighbors as the prediction value of the
query instance
k-NN Example
Acid Durability Strength Classification
(X1) (X2 ) (Y)
 k-nearest neighbors algorithm steps.
Step 1. Determine parameter k= number of
7 7 Bad
nearest neighbors
7 4 Bad
Step 2. Calculate the distance between the
query-instance and all the training
samples 3 4 Good
Step 3. Sort the distance and determine
nearest neighbors based on the k-th 1 4 Good
minimum distance
Question: A factory produces a new
Step 4. Gather the category Y of the nearest
paper tissue that passes the laboratory
neighbor
test with X1= 3 and X2 =7. Classify this
Step 5. Use simple majority of the category
of nearest neighbors as the prediction paper as good or bad using k-nearest
value of the query instance neighbor method.
k-NN Example
Step2. Calculate the distance between the query-
instance and all the training examples

Acid Durability Strength Square distance to query instance


(X2 ) (3,7)
(X1 )
7 7 (7-3)2 + (7-7)2 = 16

7 4 (7-3)2 + (4-7)2 =25

3 4 (3-3)2 + (4-7)2 =9

1 4 (1-3)2 + (4-7)2 =13


k-NN Example
Step 3. Sort the distance and determine nearest neighbors based on
the k-th minimum distance

Acid Durability Strength Square distance to query Rank Is it included


(X1 ) (X2) instance (3,7) minimum in 3-nearest
distance neighbors?
7 7 (7-3)2 + (7-7)2 =16 3 Yes

7 4 (7-3)2 + (4-7)2 =25 4 No

3 4 (3-3)2 + (4-7)2 =9 1 Yes

1 4 (1-3)2 + (4-7)2 =13 2 Yes


k-NN Example
Step 4. Gather the category Y of the nearest neighbors.
Notice that the second row last column that the category of
nearest neighbor
(Y) is not included
Acid Durability
because
Strength the rank
Square distance to queryof this
Rank data
Is itis more Y=
included than 3 of
Category
(=k). (X2 ) instance (3,7) minimum in 3-nearest nearest
distance neighbors? neighbor
(X1 )

7 7 (7-3)2 + (7-7)2 =16 3 Yes Bad

7 4 (7-3)2 + (4-7)2 =25 4 No -

3 4 (3-3)2 + (4-7)2 =9 1 Yes Good

1 4 (1-3)2 + (4-7)2 =13 2 Yes Good


k-NN Example
 Step 5. Use simple majority of the category of nearest neighbors as the
prediction value of the query instance.

 In this example we have 2 good and 1 bad, since 2>1, we


conclude that a new paper tissue that pass lab test with X1 =3 and X2 =
7 is classified as GOOD.

* Please try (x1=2, x2=6)


Nearest Neighbor Classification
Advantages of KNN
It is extremely easy to implement
Requires no training prior to making real time predictions. This makes the KNN algorithm much faster than
other algorithms that require training e.g SVM, linear regression.
There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g.
Euclidean or Manhattan etc.)
Disadvantages of KNN
The KNN algorithm doesn't work well with high dimensional data because with large number of
dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
The KNN algorithm doesn't work well with categorical features since it is difficult to find the distance
between dimensions with categorical features.
Classification - K nearest neighbor (KNN)
Application areas of KNN
 Text mining
 Agriculture
 Finance
 Medical
 Facial recognition
 Recommendation systems (Amazon, Hulu, Netflix, etc)

Source: https://medium.com/@arman_hussain786/k-nearest-neighbors-knn-and-its-applications
Classification – Decision Tree
A Decision Tree is a flowchart-like tree structure, where:
each internal node (non-leaf node) denotes a test on an attribute,
each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label

Root node

Leaf nodes
Classification – Decision Tree
How does classification is performed using DT?
Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested
against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for
that tuple.
Decision trees can easily be converted to classification rules.

Question: is DT supervised or unsupervised learning?


Classification – Decision Tree
Why DTs are popular classifier methods:-
Generation of trees doesn’t require any domain knowledge
Can handle multidimensional data
The learning and classification steps of DT induction are simple and fast
Generally DTs have good accuracy.
Decision tree induction
algorithms have been used for classification in many application areas such as
medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology
Classification – Decision Tree
Attribute selection measures – used to choose the attribute that best dissects the
instances into unique/distinct classes
Information gain
Gain ratio
Gini index
Tree pruning – while building trees, many of the attributes may reflect noise or
outliers in the training set. Therefore, tree pruning is undertaken to remove noise and
outliers, with the goal of improving classification accuracy.
Classification – Decision Tree
Popular decision tree algorithms:
ID3 (Iterative Dichotomiser) by J. Ross Quinlan
C4.5
CART (Classification and Regression Tree)
All the three adopt greed approach in which DTs are build in top-down recursive divide-and-conquer
manner.
While DTs are built training examples are recursively divided into smaller
units/subsets.
Classification – Decision Tree
Splitting attribute types

Discrete
valued
attribute

Continuous- valued
attribute

Discrete-valued and
binary tree.
Classification – Decision Tree
- attribute selection measures
Also known as splitting rules, since they determine how the instances at a certain
node are to be split.
Provided that training instances, the attribute selection measure gives a ranking for
each attribute.
The attribute with highest score is chosen as splitting attribute
The splitting criterion is used to label for a tree node created for partition of the
dataset.
Branches grow according to the outcome of the splitting criterion, and instances are
partitioned accordingly.
Classification – Decision Tree
- attribute selection measures
1) Information gain:
It is based on the theory of information proposed by Claude Shannon.
ID3 uses it as attribute selection measure
Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is
chosen as the splitting attribute for node N
The attribute minimizes the information needed to classify the instances
Entropy
• D - example set, C1,...,CN - classes
• Entropy E(D) – measure of the impurity of training set
S

Info(D) is the average amount of information needed to identify the class label
of a tuple in D.
• Entropy in binary classification
problems
E(D) = - p+ log2p+ - p- log2p-
Entropy
• E(D) = - p+ log2p+ - p- log2p-
• The entropy function relative to a Boolean
classification, as the proportion p+ of positive
examples varies between 0 and 1
1
0,9
0,8
0,7
Entropy(S)

0,6
0,5
0,4
0,3
0,2
0,1
0
0 0,2 0,4 0,6 0,8 1 p+
What is entropy?
• Entropy E(D) = expected amount of information (in bits) needed to
assign a class to a randomly drawn object in S under the optimal,
shortest-length code

• Information theory: optimal length code assigns -log2p bits to a


message having probability p

• So, in binary classification problems, the expected number of bits


to encode + or – of a random member of S is:

p+ ( - log2p+ ) + p- ( - log2p- ) = - p+ log2p+ - p- log2p-


Information-Theoretic Approach
To classify an object, a certain information is needed
Info(D), information
After we have learned the value of A, we only need some remaining amount of information to
classify an object
Ires, residual information
The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes the Gain:
Gain
Gain(A) = Info(D)– Ires(A)
Residual Information

After applying attribute A, S is partitioned into subsets according to values v of A


Ires represents the amount of information still needed to classify an instance
Ires is equal to weighted sum of the amounts of information for the subsets
Ires = infoA(v)

p(c|v) = probability that an


instance belongs to class C
given that it belongs to v

= InfoA(v)
Information Gain:
= the amount of information I rule out by splitting on attribute A:
Gain(A) = Info(D) – Ires(A)
= information in the current set minus the residual information
after splitting

The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes
the Gain
Classification – Decision Tree (ID3)
Class_yes = 9, Class_no = 5
age v=youth(Yes) = 2 age v=youth(No) = 3
age v = middle_age(Yes)= 4 age v =middle_age(No)=0
age v= Senior(Yes) =3 age v=senior(No)=2

Calculate Entropy of the data base (D)

Calculate Entropy of age (residual information)


Classification – Decision Tree (ID3)

Gain(age) = 0.246 bits


Gain(Income) = 0.029 bits
Gain(student) = 0.151 bits
Gain(credit_rating)= 0.048

Source: Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed.
Classification – Decision Tree (ID3)
Question:
1. Calculate
Gain(income)
Gain (student)
Gain(credit_rating)
2. And identify the highest Gain.
3. Build the decision tree
Classification – Decision Tree (ID3)
TID Age income Loan_decison Class_Safe = 7, Class_Risky = 3
1 youth Low Risky age v=youth(Safe) = 2 age v=youth(Risky) =2
2 Youth Low Risky age v = middle_age(Safe)= 2 age v =middle_age(Risky)=1
3 Middle-aged High Safe age v= Senior(Safe) =3 age v=senior(Risky)=0
4 Middle-aged Low Risky
5 Senior Low Safe Calculate Entropy of the data base (D)
6 Senior Medium Safe
7 Youth High Safe
8 Middle-aged Medium Safe Calculate Entropy of age (residual information)
9 Youth Medium Safe
10 Senior High Safe
When to use
Decision Tree Learning?
Appropriate problems for
decision tree learning

Classification problems

Characteristics:
instances described by attribute-value pairs
target function has discrete output values
training data may be noisy
training data may contain missing attribute values
Strengths

can generate understandable rules


perform classification without much computation
can handle continuous and categorical variables
provide a clear indication of which fields are most important for
prediction or classification
Weakness

Not suitable for prediction of continuous attribute.


Perform poorly with many class and small data.
Computationally expensive to train.
At each node, each candidate splitting field must be sorted before its best split can
be found.
In some algorithms, combinations of fields are used and a search must be made for
optimal combining weights.
Pruning algorithms can also be expensive since many potential sub-trees must be
formed and compared
Classification – Support Vector Machine
Classification – Support Vector Machine
Can classify both linear and nonlinear data.
It is a supervised classification algorithm.
They are highly accurate, even though the training time is slow.
SVM is less prone to overfitting than other methods
Can be for classification and numeric prediction as well.
Applications: Handwritten digit recognition, object recognition, speaker
identification, and time-series prediction
Classification – Support Vector Machine
SVM for linear separable problem
let’s consider an example based on two input attributes, A1 and A2
a straight line can be drawn to separate all the tuples of class C1 from all the tuples of class −1, where there
are infinite number of separating lines that could be drawn.
We want to find the best hyperplane that would separate the two classes.
Classification – Support Vector Machine
SVM searches for the maximum marginal hyper plane (MMH)
Hyperplane with larger margin can better and accurately separate instances than the
one with smaller margin.
During training, SVM attempts to search for a hyperplane with the largest margin, i.e.
the maximum marginal hyper plane (MMH)
Classification – Support Vector Machine
The separating hyperplane can be represented as:

Where W is weight vector, W = {w1, w2, w3, …., Wn}; n is the number of attributes
and b is a scalar(bias).
Given a training instance X = (x1, x2) where x1 and x2 are values of attributes A1
and A2 respectively, let b be initial weight, then:

Any point above and below the hyper plane are represented by:
Classification – Support Vector Machine

Source: https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
Classification – Support Vector Machine
The hyperplanes defining the sides of the margin can be written as:

Any instance that falls on or above H1 belongs to class +1 and any instance that falls on
or below H2 belongs to -1.
Any training instances that fall on the hyperplanes H1 and H2, the sides defining the
margin, are called support vectors. They are equally close to the MMH.
The distance from the separating hyperplane to any point on H1 is
||W|| is the Euclidean norm of W, i.e.
Which is equal to the distance from bay point to H2
Classification – Support Vector Machine
SVM for linear inseparable instances.
SVM can also be used to classify non-linear problems.

 Transform the original input data into a higher dimensional space


using a nonlinear mapping
 Once the data have been transformed into the new higher space, the second step searches for a linear
separating hyperplane in the new space
Kernel functions are used in non-linear SVM

Source: https://towardsdatascience.com/support-vector-machine
Classification – Support Vector Machine
Application of SVM
Face detection
Text and hypertext categorization
Classification of images
Bioinformatics
Handwriting recognitions
…and more

Question: Can we SVM for multiclass classification?


Classification – Naïve Bayes (NB)
Is one of the Bayes Classification methods, which are statistical classifiers.
Can predict class membership probability of a given instance.
Assumption:
Class conditional independence – which means the effect of an attribute value on a given class is
independent of the values of the other attributes.
That is why Naïve bayes , named as “naïve “
Classification – Naïve Bayes (NB)
Let X be an instance and H be a hypothesis that the instance X belongs to a certain class C.
We want to determine Posterior probability, P(H|X), which is the probability that the
hypothesis H holds given the instance X. Prior
Posterior Probability
probability

P(H) is the prior probability of H, for example the probability that any given customer will
buy a computer, regardless of age and income.
P(X|H) is the posterior probability of X conditioned on H.
P(X) is the prior probability, e.g. the probability that a person is 25 yrs & income = 40,000.
P(H), P(X|H) and P(X) can be computed from the dataset
Classification – Naïve Bayes (NB)
Given a dataset D, containing instances, denoted by X, and Class labels (denoted by C i ).
Naïve Bayes classifies an instance X to a class having the highest posterior probability
conditioned on X. NB predicts the class of X using:

X belong to the class Ci if and only if:

P(X) is constant for all classes, hence we only maximize


Classification – Naïve Bayes (NB)

In a dataset with many attributes, the class-conditional independence assumption is


made, where it presumes the attributes’ values are conditionally independent of one
another (no dependence relationship among the attributes.
Is computed from
Hence, the instances in the
dataset

If the attribute X=x1, is categorical, the P(x1|Ci) is the number of instances of class Ci in
the dataset having the value x1 divided by the total number of instances in the dataset
If x1 is continuous-valued, then we assume the attributes have Gaussian distribution with
mean and standard deviation.

therefore,
Classification – Naïve Bayes (NB)

Let C1 be buys_computer =yes and C2


be buys_computer = no
Question : given the dataset to which
class does the instance:
X = (age=youth, income=medium,
student = yes, credit_rating=fair)
Calculate the following:
P(C1 = yes) = ?
P(C2 = no) = ?
To compute P(X|Ci), first compute the
conditional probabilities :
Classification – Naïve Bayes (NB)
Question : given the dataset to which class does the
instance:
X = (age=youth, income=medium, student = yes,
credit_rating=fair)
Let C1 be buys_computer =yes and C2 be
buys_computer = no
Calculate the following:
P(C1 = yes) = 9/14 = 0.643
P(C2 = no) = 5/14 = 0.357
To compute P(X|Ci), first compute the conditional
probabilities :

P(age=youth|C1) = ?
P(age=youth|C2) =?
P(income = medium | C1) =?
P(income = medium | C2) =?
P(student =yes | C1) =?
P(student =yes | C2) =?
P(credit_rating =fair |C1) =?
Classification – Naïve Bayes (NB)
P(X|C1) = P(age=youth|C1) x P(income =
medium | C1) x P(student =yes | C1) x
P(credit_rating =fair |C1)
= 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C2) = P(age=youth|C2) x P(income =
medium | C2) x P(student =yes | C2) x
P(credit_rating =fair |C2)
= 0.6 x 0.4 x 0.2 x 0.4 = 0.019
To find the class Ci. We compute P(X|Ci)P(Ci):
P(X|C1) P(C1) = 0.028
P(X|C2) P(C2)=0.007

Therefore, the classifier predict X to the class


C1, which is buyes_computer = yes.
Classification – Naïve Bayes (NB)
Application of Naïve Bayes Classification algorithm:
Text classification
Spam filtering
Sentiment analysis
Multi-class classification
Recommendation system
Credit scoring
Medical data classification
Real-time predictions
Assignment
Given the table, predict the class for the
following instances
1) X = (age=Senior, income=medium, student =
yes, credit_rating=excellent)
2) Z = (age=middle_aged, income=low, student
= no, credit_rating=fair)
3) R = (age=Youth, income=medium, student =
no, credit_rating=fair)
Neural Network
Also connectionist model or parallel distributed processing
A neural network is an interconnected assembly of simple processing
elements, units or nodes, whose functionality is loosely based on the
animal neuron.
The processing ability of the network is stored in the interunit
connection strengths, or weights, obtained by a process of adaptation
to, or learning from, a set of training patterns.
Learning in Neural Network

NN is represented as a layered set of interconnected processors.


These processor nodes has a relationship with the neurons of the
brain.
A network with the input and output layer only is called
single-layered neural network. Whereas, a multilayer neural
network is a generalized one with one or more hidden layer.
A network containing two hidden layers is called a three-layer neural
network, and so on.
Learning in Neural Network
ANN: often called neural network or simply neural net(NN)
An extremely simplified model of the brain
Transforms inputs into outputs to the best of its ability
Composed of many “neurons” that co-operate to perform
the desired function
An artificial representation of human brain that tries to
simulate its learning process
Nodes and Layer
• Each node has a weighted connection to several other nodes in adjacent layers.
Individual nodes take the input received from connected nodes and use the
weights together to compute output values.
• The inputs are fed
simultaneously into the input
layer.
• The weighted outputs of
these units are fed into
hidden layer.
• The weighted outputs of the
last hidden layer are inputs to
units making up the output
layer.
Architecture of Neural network
Neural networks are used to look for patterns in data, learn these
patterns, and then classify new patterns & make forecasts
A network with the input and output layer only is called single-
layered neural network. Whereas, a multilayer neural network is a
generalized one with one or more hidden layer.
A network containing two hidden layers is called a three-layer neural
network, and so on.

Multilayer NN
Single layered NN n
x1 x1
w1 o  (  wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
 ( y) 
1  e y Input Hidden Output
nodes nodes nodes
Architecture of Neural net
• A network containing two hidden layers is called a three-layer
neural network, and so on.

2 layer NN 3 layer NN

You might also like