0% found this document useful (0 votes)
32 views

MACHINE LEARNING

The Machine Learning Tutorial provides an overview of machine learning concepts, including its definitions, types, and applications. It explains the significance of machine learning in various fields, such as image and speech recognition, self-driving cars, and product recommendations. The document also outlines the historical development of machine learning and its growing importance in today's technology landscape.

Uploaded by

srinivas060788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

MACHINE LEARNING

The Machine Learning Tutorial provides an overview of machine learning concepts, including its definitions, types, and applications. It explains the significance of machine learning in various fields, such as image and speech recognition, self-driving cars, and product recommendations. The document also outlines the historical development of machine learning and its growing importance in today's technology landscape.

Uploaded by

srinivas060788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 97

The Machine Learning Tutorial covers the fundamentals and more complex machine learning

ideas. Students and professionals in the workforce can benefit from our machine-learning
tutorial.

Machine learning allows computers to learn from previous data automatically, a rapidly
developing field of technology. For building mathematical models and making predictions
based on historical data or information, machine learning employs a variety of algorithms. It
is currently used for various tasks, including speech recognition, email filtering, auto-tagging
on Facebook, a recommender system, and image recognition.

You will learn about the many different methods of machine learning, including
reinforcement learning, supervised learning, and unsupervised learning, in this machine
learning tutorial. Regression and classification models, clustering techniques, hidden Markov
models, and various sequential models will all be covered.

What is Machine Learning


In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines that work on
our instructions. But can a machine also learn from experiences or past data like a human
does? So here comes the role of Machine Learning.

Introduction to Machine Learning

A subset of artificial intelligence known as machine learning focuses primarily on the


creation of algorithms that enable a computer to independently learn from data and previous
experiences. Arthur Samuel first used the term "machine learning" in 1959. It could be
summarized as follows:

1
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.

Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. To develop predictive models, machine learning brings together
statistics and computer science. Algorithms that learn from historical data are either
constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.

A machine can learn if it can gain more data to improve its performance.

How does Machine Learning work


A machine learning system builds prediction models, learns from previous data, and predicts
the output of new data whenever it receives it. The amount of data helps to build a better
model that accurately predicts the output, which in turn affects the accuracy of the predicted
output.

Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic based
on the data and predict the output. Our perspective on the issue has changed because of
machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:

Features of Machine Learning:


 Machine learning uses data to detect various patterns in a given dataset.
 It can learn from past data and improve automatically.
 It is a data-driven technology.
 Machine learning is very similar to data mining as it also deals with a huge amount of
data.
Need for Machine Learning
The demand for machine learning is steadily rising. Because it can perform tasks that are too
complex for a person to directly implement, machine learning is required. Humans are
constrained by our inability to manually access vast amounts of data; as a result, we require
computer systems, which is where machine learning comes in to simplify our lives.

2
By providing them with a large amount of data and allowing them to automatically explore
the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
The significance of AI can be handily perceived by its utilization cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion ideas by Facebook, and so on. Different top organizations,
for example, Netflix and Amazon have constructed AI models that utilize an immense
measure of information to examine the client's interest and suggest items.
Following are some key points which show the importance of Machine Learning:
 Rapid increment in the production of data
 Solving complex problems, which are difficult for a human
 Decision-making in various sectors including finance
 Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning


At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

3
Supervised Learning
In supervised learning, sample-labelled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labelled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning.
Managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.
Supervised learning can be grouped further into two categories of algorithms:
 Classification
 Regression
Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with a set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal
of unsupervised learning is to restructure the input data into new features or a group of
objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from a huge amount of data. It can be further classified into two categories of
algorithms:
 Clustering
 Association

Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the
most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

History of Machine Learning


Before some years ago (about 40-50 years ago), machine learning was science fiction, but
today it is a part of our daily life. Machine learning is making our day-to-day life easy
from self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind

4
machine learning is so old and has a long history. Below some milestones are given which
have occurred in the history of machine learning:

The early history of Machine Learning (Pre-1940):


 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but
all modern computers rely on its logical structure.
 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
The era of stored program computers:
 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.
 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
Computer machinery and intelligence:
 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"
Machine intelligence in Games:

5
 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more it
played.
 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The first "AI" winter:
 The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.
 In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.
Machine Learning from theory to reality
 1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.
 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in
one week.
 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
Machine Learning at 21st Century
2006:
 Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
 The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.
2007:
 Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.
 Support learning made critical progress when a group of specialists utilized it to
prepare a PC to play backgammon at a top-notch level.
2008:
 Google delivered the Google Forecast Programming interface, a cloud-based help that
permitted designers to integrate AI into their applications.
 Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.
2009:
 Profound learning gained ground as analysts showed its viability in different errands,
including discourse acknowledgment and picture grouping.

6
 The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.

2010:
 The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was
presented, driving progressions in PC vision, and prompting the advancement of
profound convolutional brain organizations (CNNs).
2011:
 On Jeopardy! IBM's Watson defeated human champions., demonstrating the potential
of question-answering systems and natural language processing.
2012:
 AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,
fundamentally further developing picture order precision and laying out profound
advancing as a predominant methodology in PC vision.
 Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized profound
figuring out how to prepare a brain organization to perceive felines from unlabeled
YouTube recordings.
2013:
 Ian Goodfellow introduced generative adversarial networks (GANs), which made it
possible to create realistic synthetic data.
 Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.
2014:
 Facebook presented the DeepFace framework, which accomplished close human
precision in facial acknowledgment.
 AlphaGo, a program created by DeepMind at Google, defeated a world champion Go
player and demonstrated the potential of reinforcement learning in challenging games.
2015:
 Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-
source profound learning library.
 The performance of sequence-to-sequence models in tasks like machine translation
was enhanced by the introduction of the idea of attention mechanisms.
2016:
 The goal of explainable AI, which focuses on making machine learning models easier
to understand, received some attention.
 Google's DeepMind created AlphaGo Zero, which accomplished godlike Go abilities
to play without human information, utilizing just support learning.

7
2017:
 Move learning acquired noticeable quality, permitting pre-trained models to be
utilized for different errands with restricted information.
 Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
These are only a portion of the eminent headways and achievements in AI during the
predefined period. The field kept on advancing quickly past 2017, with new leap forwards,
strategies, and applications arising.
Machine Learning at present:
The field of machine learning has made significant strides in recent years, and its applications
are numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender
system. It incorporates clustering, classification, decision trees, SVM algorithms, and
reinforcement learning, as well as unsupervised and supervised learning.
Present-day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.

8
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:

1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a


photo with our Facebook friends, then we automatically get a tagging suggestion with name,
and the technology behind this is machine learning's face detection and recognition
algorithm.

9
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

10
6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and


spam. We always receive an important mail in our inbox with the important symbol and spam
emails in our spam box, and the technology behind this is Machine learning. Below are some
spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.

9. Stock Market trading:

11
Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short-term memory
neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumours and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

The technology behind the automatic translation is a sequence-to-sequence learning


algorithm, which is used with image recognition and translates the text from one language to
another language.

Machine learning Life cycle


Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

12
The most important thing in the complete process is to understand the problem and to know
the purpose of the problem. Therefore, before starting the life cycle, we need to understand
the problem because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.
This step includes the below tasks:
 Identify various data sources
 Collect data
 Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

13
2. Data Preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
Data exploration:
It is used to understand the nature of the data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations,
general trends, and outliers.
Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a usable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that the data we have collected is always for our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
 Missing Values
 Duplicate data
 Invalid data
 Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
 Selection of analytical techniques
 Building models
 Review the result

14
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques such as Classification,
Regression, Cluster analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is like making the final report for a project.

Difference between Artificial intelligence and


Machine learning
Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies which
are used for creating intelligent systems.

Although these are two related technologies and sometimes people use them as a synonym
for each other, but still, both are the two different terms in various cases.

On a broad level, we can differentiate both AI and ML as:

“AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behaviour, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.”

15
Below are some main differences between AI and machine learning along with the overview
of Artificial intelligence and machine learning.

Artificial Intelligence

Artificial intelligence is a field of computer science which makes a computer system that can
mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which
means "a human-made thinking power." Hence, we can define it as,

“Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.”

The Artificial intelligence system does not require to be pre-programmed, instead of that,
they use such algorithms which can work with their own intelligence. It involves machine
learning algorithms such as Reinforcement learning algorithm and deep learning neural
networks. AI is being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess
playing, etc.

Based on capabilities, AI can be classified into three types:

 Weak AI
 General AI
 Strong AI

Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.

16
Machine learning

Machine learning is about extracting knowledge from the data. It can be defined as,

Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or take some decisions
using historical data without being explicitly programmed. Machine learning uses a massive
amount of structured and semi-structured data so that a machine learning model can generate
accurate result or give predictions based on that data.

Machine learning works on algorithm which learn by it?s own using historical data. It works
only for specific domains such as if we are creating a machine learning model to detect
pictures of dogs, it will only give result for dog images, but if we provide a new data like cat
image then it will become unresponsive. Machine learning is being used in various places
such as for online recommender system, for Google search algorithms, Email spam filter,
Facebook Auto friend tagging suggestion, etc.

It can be divided into three types:

 Supervised learning
 Reinforcement learning
 Unsupervised learning

Key differences between Artificial Intelligence (AI) and Machine


learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology Machine learning is a subset of AI which allows a


which enables a machine to simulate machine to automatically learn from past data
human behavior. without programming explicitly.

The goal of AI is to make a smart The goal of ML is to allow machines to learn from
computer system like humans to solve data so that they can give accurate output.
complex problems.

In AI, we make intelligent systems to In ML, we teach machines with data to perform a
perform any task like a human. particular task and give an accurate result.

Machine learning and deep learning are Deep learning is a main subset of machine learning.
the two main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

17
AI is working to create an intelligent Machine learning is working to create machines
system which can perform various that can perform only those specific tasks for which
complex tasks. they are trained.

AI system is concerned about Machine learning is mainly concerned about


maximizing the chances of success. accuracy and patterns.

The main applications of AI are Siri, The main applications of machine learning
customer support using catboats, are Online recommender system, Google search
Expert System, Online game playing, algorithms, Facebook auto friend tagging
intelligent humanoid robot, etc. suggestions, etc.

On the basis of capabilities, AI can be Machine learning can also be divided into mainly
divided into three types, which three types that are Supervised
are, Weak AI, General AI, and Strong learning, Unsupervised learning,
AI. and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when
correction. introduced with new data.

AI completely deals with Structured, Machine learning deals with Structured and semi-
semi-structured, and unstructured data. structured data.

18
How to get datasets for Machine Learning
The field of ML depends vigorously on datasets for preparing models and making precise
predictions. Datasets assume a vital part in the progress of AIML projects and are
fundamental for turning into a gifted information researcher. In this article, we will
investigate the various sorts of datasets utilized in AI and give a definite aid on where to track
down them.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an example
of the dataset:

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes

Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes

A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.

Types of data in datasets


 Numerical data: Such as house price, temperature, etc.
 Categorical data: Such as Yes/No, True/False, Blue/green, etc.
 Ordinal data: These data are similar to categorical data but can be measured on the
basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.

19
Types of datasets

Machine learning incorporates different domains, each requiring explicit sorts of datasets. A
few normal sorts of datasets utilized in machine learning include:

Image Datasets:

Image datasets contain an assortment of images and are normally utilized in computer vision
tasks such as image classification, object detection, and image segmentation.

Examples :

o ImageNet
o CIFAR-10
o MNIST

Text Datasets:

Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.

Examples :

o Gutenberg Task dataset


o IMDb film reviews dataset

Time Series Datasets:

Time series datasets include information focuses gathered after some time. They are
generally utilized in determining, abnormality location, and pattern examination.

Examples :

o Securities exchange information


o Climate information
o Sensor readings.

Tabular Datasets:

Tabular datasets are organized information coordinated in tables or calculation sheets. They
contain lines addressing examples or tests and segments addressing highlights or qualities.

20
Tabular datasets are utilized for undertakings like relapse and arrangement. The dataset given
before in the article is an illustration of a tabular dataset.

Need of Dataset

o Completely ready and pre-handled datasets are significant for machine learning
projects.
o They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
o To address these difficulties, productive information the executive's strategies and are
expected to handle calculations.

Data Pre-processing:

Data pre-processing is a fundamental stage in preparing datasets for machine learning. It


includes changing raw data into a configuration reasonable for model training. Normal pre-
processing procedures incorporate data cleaning to eliminate irregularities or blunders,
standardization to scale data inside a particular reach, highlight scaling to guarantee
highlights have comparative ranges, and taking care of missing qualities through ascription or
evacuation.

During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:

o Training dataset:
o Test Dataset

21
Note: The datasets are of large size, so to download these datasets, you must have fast
internet on your computer.

Training Dataset and Test Dataset:

In machine learning, datasets are ordinarily partitioned into two sections: the training dataset
and the test dataset. The training dataset is utilized to prepare the machine learning model,
while the test dataset is utilized to assess the model's exhibition. This division surveys the
model's capacity, to sum up to inconspicuous data. It is fundamental to guarantee that the
datasets are representative of the issue space and appropriately split to stay away from
inclination or overfitting.

Popular sources for Machine Learning datasets

Below is the list of datasets which are freely available for the public to work on it:

22
1. Kaggle Datasets

Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve difficult
Data Science related tasks.

Kaggle provides a high-quality dataset in different formats that we can easily find and
download.

The link for the Kaggle dataset is https://www.kaggle.com/datasets.

2. UCI Machine Learning Repository

The UCI Machine Learning Repository is an important asset that has been broadly utilized by
scientists and specialists beginning around 1987. It contains a huge collection of datasets
sorted by machine learning tasks such as regression, classification, and clustering.
Remarkable datasets in the storehouse incorporate the Iris dataset, Vehicle Assessment
dataset, and Poker Hand dataset.

23
The link for the UCI machine learning repository is https://archive.ics.uci.edu/ml/index.php.

3. Datasets via AWS

We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.

24
This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone
can add any dataset or example to the Registry of Open Data on AWS.

4. Google's Dataset Search Engine

Google's Dataset Web index helps scientists find and access important datasets from different
sources across the web. It files datasets from areas like sociologies, science, and
environmental science. Specialists can utilize catchphrases to find datasets, channel results in
light of explicit standards, and access the datasets straightforwardly from the source.

5. Microsoft Datasets

The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences. It gives admittance to assorted and arranged datasets
that can be significant for machine learning projects.

25
6. Awesome Public Dataset Collection

Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.

The link to download the dataset from Awesome public dataset collection
is https://github.com/awesomedata/awesome-public-datasets.

7. Government Datasets

There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.

26
The goal of providing these datasets is to increase transparency of government work among
the people and to use the data in an innovative approach. Below are some links of
government datasets:

o Indian Government dataset


o US Government Dataset
o Northern Ireland Public Sector Datasets
o European Union Open Data Portal

8. Computer Vision Datasets

Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you can
refer to this source.

The link for downloading the dataset from this source is https://www.visualdata.io/.

9. Scikit-learn dataset

Scikit-learn, a well-known machine learning library in Python, gives a few underlying


datasets to practice and trial and error. These datasets are open through the sci-kit-learn
Programming interface and can be utilized for learning different machine-learning
calculations. Scikit-learn offers both toy datasets, which are little and improved, and genuine
world datasets with greater intricacy. Instances of sci-kit-learn datasets incorporate the Iris
dataset, the Boston Lodging dataset, and the Wine dataset.

27
The link to download datasets from this source
is https://scikit-learn.org/stable/datasets/index.html.

Data Ethics and Privacy:

Data ethics and privacy are basic contemplations in machine learning projects. It is
fundamental to guarantee that data is gathered and utilized morally, regarding privacy
freedoms and observing pertinent regulations and guidelines. Data experts ought to go to
lengths to safeguard data privacy, get appropriate assent, and handle delicate data mindfully.
Assets, for example, moral rules and privacy structures can give direction on keeping up with
moral practices in data assortment and use.

Data Preprocessing in Machine learning


Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required

28
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

It involves below steps:

 Getting the dataset

 Importing libraries

 Importing datasets

 Finding Missing Data

 Encoding Categorical Data

 Splitting dataset into training and test set

 Feature scaling

1) Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

What is a CSV File?

CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning. For
real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.

We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.

2) Importing Libraries

29
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:

1. import numpy as nm

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any type
of charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.

30
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:

31
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three independent
variables that are Country, Age, and Salary, and one is a dependent variable which
is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as:

1. [['India' 38.0 68000.0]


2. ['France' 43.0 45000.0]

32
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent
variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

33
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of
rest column values.

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

34
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.

Dummy Variables:

Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable


2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()

35
Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:

For Purchased Variable:

1. labelencoder_y= LabelEncoder()

36
2. y= labelencoder_y.fit_transform(y)

For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:

6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

37
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.

AD

Output:

By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.

38
As we can see in the above image, the x and y variables are divided into 4 different variables
with corresponding values.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.

Consider the below dataset:

As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then it
will cause some issue in our machine learning model.

Euclidean distance is given as:

If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.
39
There are two ways to perform feature scaling in machine learning:

Standardization

Normalization

Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

1. from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.

1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead


of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:

40
x_test:

As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more
understandable.

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable

41
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)

42
In the above code, we have included all the data preprocessing steps together. But there are
some steps or lines of code which are not necessary for all machine learning models. So we
can exclude them from our code to make it reusable for all models.

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled
data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.

43
The working of Supervised learning can be easily understood by the below example and
diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset


o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as
the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

44
Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Note: We will discuss these algorithms in detail in later chapters.


AD

Advantages of Supervised learning:

45
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are trained
using labeled data under the supervision of training data. But there may be many cases in
which we do not have labeled data and need to find the hidden patterns from the given
dataset. So, to solve such types of cases in machine learning, we need unsupervised learning
techniques.
What is Unsupervised Learning?
As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes place in
the human brain while learning new things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.”
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.

46
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:

47
Here, we have taken an unlabelled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabelled input data is fed to the
machine learning model to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between
the data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis

48
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning as it


does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.

SUPERVISED LEARNING
Regression Analysis in Machine learning:
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:

49
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:


 Prediction of rain using temperature and other factors
 Determining Market trends
 Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to


predict or understand is called the dependent variable. It is also called target
variable.

50
o Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable.


There are various scenarios in the real world where we need some future predictions such as
weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyse the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:

o Linear Regression
o Logistic Regression

51
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.

52
Below is the mathematical equation for Linear regression:

Y= aX+b

Here, Y=dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:


 Analyzing trends and sales estimates
 Salary forecasting
 Real estate prediction
 Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.

53
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
 Binary(0/1, pass/fail)
 Multi(cats, dogs, lions)
 Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.

54
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x
is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional


data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR,
it is a line which helps to predict the continuous variables and cover most of the
datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.

55
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test, and
each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which
splits into left and right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node of those
nodes. Consider the below image:

56
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output. The combined decision trees are called as base models, and it can be
represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble


learning in which aggregated decision tree runs in parallel and do not interact with
each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model
by creating random subsets of the dataset.

57
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

58
Linear Regression in Machine Learning:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε

Here,

Y = Dependent Variable (Target Variable)


X = Independent Variable (predictor Variable)
a0 = intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

59
Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.

Multiple Linear regression: If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

Positive Linear Relationship: If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a relationship is termed as a Positive
linear relationship.

Negative Linear Relationship: If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a relationship is called a negative
linear relationship.

60
Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observations


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will
be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.

61
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.

o Linear relationship between the features and target: Linear regression assumes the
linear relationship between the dependent and independent variables.
o Small or no multicollinearity between the features: Multicollinearity means a high
correlation between the independent variables. Due to multicollinearity, it may be
difficult to find the true relationship between the predictors and target variables. Or
we can say, it is difficult to determine which predictor variable is affecting the target
variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.

62
o Homoscedasticity Assumption: Homoscedasticity is a situation when the error term
is the same for all the values of independent variables. With homoscedasticity, there
should be no clear pattern distribution of data in the scatter plot.
o Normal distribution of error terms: Linear regression assumes that the error term
should follow the normal distribution pattern. If error terms are not normally
distributed, then confidence intervals will become either too wide or too narrow,
which may cause difficulties in finding coefficients. It can be checked using the q-q
plot. If the plot shows a straight line without any deviation, which means the error is
normally distributed.
o No autocorrelations: The linear regression model assumes no autocorrelation in error
terms. If there is any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.

Simple Linear Regression in Machine Learning:


Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience, and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)

63
Implementation of Simple Linear Regression Algorithm using Python:

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.

In this section, we will create a Simple Linear Regression model to find out the best fitting
line for representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we
need to follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing. We
have already done it earlier in this tutorial. But there will be some changes, which are given
in the below steps:

o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.

import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
 Next, we will load the dataset into our code:
data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.

64
The above output shows the dataset, which has two variables: Salary and Experience.

 After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable is
salary. Below is code for it:

x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.
By executing the above line of code, we will get the output for X and Y variable as:

In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.

65
 Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state
=0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:
Test-dataset:

Training Dataset:

66
 For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here. Now, our
dataset is well prepared to work on it and we are going to start building a Simple
Linear Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to the Training Set:
Now the second step is to fit our model to the training dataset. To do so, we will import the
Linear Regression class of the linear model library from the scikit learn. After importing the
class, we are going to create an object of the class named as a regressor. The code for this is
given below:
#Fitting the Simple Linear Regression model to the training dataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the
predictor and target variables. After executing the above lines of code, we will get the below
output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normaliz
Step: 3. Prediction of test set result:
dependent (salary) and an independent variable (Experience). So, now, our model is ready to
predict the output for the new observations. In this step, we will provide the test dataset (new
observations) to the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test
dataset, and prediction of training set respectively.

#Prediction of Test and Training set result


y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
On executing the above lines of code, two variables named y_pred and x_pred will generate
in the variable explorer options that contain salary predictions for the training set and test set.
Output:

67
You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these values,
we can check how good our model is performing.
Step: 4. visualizing the Training set results:
Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step.
The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are
taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the
pyplot library. In this function, we will pass the years of experience for training set, predicted
salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot
library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:

mtp.scatter(x_train, y_train, color="green")


mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

Output:

By executing the above lines of code, we will get the below graph plot as an output.

68
In the above plot, there are observations given by the blue color, and prediction is given by
the red regression line. As we can see, most of the observations are close to the regression
line, hence we can say our Simple Linear Regression is a good model and able to make good
predictions.

Multiple Linear Regression


In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may
be various cases in which the response variable is affected by more than one predictor
variable; for such cases, the Multiple Linear Regression algorithm is used.
Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes
more than one predictor variable to predict the response variable. We can define it as:
“Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.”

Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
Some key points about MLR:
 For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
 Each feature variable must model the linear relationship with the dependent variable.
 MLR tries to fit a regression line through a multidimensional space of data-points

MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression,
so the same is applied for the multiple linear regression equation, the equation becomes:

Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</
sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

69
Assumptions for Multiple Linear Regression:
o linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.

Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main


information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for
a financial year. Our goal is to create a model that can easily determine which company has
a maximum profit, and which is the most affecting factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables
are independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial.
This process contains the below steps:

 Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains
all the variables. Below is the code for it:

#importing datasets
data_set= pd.read_csv('50_CompList.csv')

70
Output: We will get the dataset as:

In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable.

o Extracting dependent and independent Variables:

#Extracting Independent and dependent Variable


x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],

71
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables which are
not suitable to apply directly for fitting the model. So we need to encode this variable.

Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use
the LabelEncoder class. But it is not sufficient because it still has some relational order,
which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:

#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables are continuous

72
Output:

As we can see in the above output, the state column has been converted into dummy variables (0 and
1). Here each dummy variable column is corresponding to the one State. We can check by
comparing it with the original dataset. The first column corresponds to the California State, the
second column corresponds to the Florida State, and the third column corresponds to the New York
State.

o Now, we are writing a single line of code just to avoid the dummy variable trap:

#avoiding the dummy variable trap:


x = x[:, 1:]
If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.

73
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given
below:

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can check the
output by clicking on the variable explorer option given in Spyder IDE. The test set and
training set will look like the below image:

Test set:

74
Training set:

Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means we will fit
our regression model to the training set. It will be similar to as we did in Simple Linear
Regression model. The code for this will be:

#Fitting the MLR model to the training set:


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)

Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normalize=False)

Now, we have successfully trained our model using the training dataset. In the next step, we
will test the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the
code for it:

#Predicting the Test set result;


y_pred= regressor.predict(x_test)

By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test set values.

75
Output:

In the above output, we have predicted result set and test set. We can check model performance by
comparing these two-value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which is a
good prediction, so, finally, our model is completed here.

o We can also check the score for training dataset and test dataset. Below is the code for
it:

print('Train Score: ', regressor.score(x_train, y_train))


print('Test Score: ', regressor.score(x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446
Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

76
What is Backward Elimination?
Backward elimination is a feature selection technique while building a machine learning
model. It is used to remove those features that do not have a significant effect on the
dependent variable or prediction of output. There are various ways to build a model in
Machine Learning, which are:

1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

Above are the possible methods for building the model in Machine learning, but we will only
use here the Backward Elimination process as it is the fastest method.

Steps of Backward Elimination

Below are some main steps which are used to apply backward elimination process:

Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)

Step-2: Fit the complete model with all possible predictors/independent variables.

Step-3: Choose the predictor which has the highest P-value, such that.

a. If P-value >SL, go to step 4.


b. Else Finish, and Our model is ready.

77
Step-4: Remove that predictor.

Step-5: Rebuild and fit the model with the remaining variables.

Need for Backward Elimination: An optimal Multiple Linear Regression model:

In the previous chapter, we discussed and successfully created our Multiple Linear
Regression model, where we took 4 independent variables (R&D spend, Administration
spends, Marketing spends, and state (dummy variables)) and one dependent variable
(Profit). But that model is not optimal, as we have included all the independent variables and
do not know which independent model is most affecting and which one is the least affecting
for the prediction.

Unnecessary features increase the complexity of the model. Hence it is good to have only the
most significant features and keep our model simple to get the better result.

So, to optimize the performance of the model, we will use the Backward Elimination method. This
process is used to optimize the performance of the MLR model as it will only include the most
affecting feature and remove the least affecting feature. Let us start to apply it to our MLR model.

Steps for Backward Elimination method:

We will use the same model which we build in the previous chapter of MLR. Below is the
complete code for it:

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('50_CompList.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values

#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()

78
#Avoiding the dummy variable trap:
x = x[:, 1:]

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

#Fitting the MLR model to the training set:


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
#Predicting the Test set result;
y_pred= regressor.predict(x_test)

#Checking the score


print('Train Score: ', regressor.score(x_train, y_train))
print('Test Score: ', regressor.score(x_test, y_test))

From the above code, we got training and test set result as:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The difference between both scores is 0.0154.

Step: 1- Preparation of Backward Elimination:

o Importing the library:Firstly, we need to import


the statsmodels.formula.api library, which is used for the estimation of various
statistical models such as OLS(Ordinary Least Square). Below is the code for it:

import statsmodels.api as smf


o Adding a column in matrix of features: As we can check in our MLR equation (a),
there is one constant term b0, but this term is not present in our matrix of features, so
we need to add it manually. We will add a column having values x 0 = 1 associated
with the constant term b0. To add this, we will use append function of Numpy library
(nm which we have already imported into our code), and will assign a value of 1.
Below is the code for it.

79
x = nm.append(arr = nm.ones((50,1)).astype(int), values=x, axis=1)
Here we have used axis =1, as we wanted to add a column. For adding a row, we can use axis
=0.
Output: By executing the above line of code, a new column will be added into our matrix of
features, which will have all values equal to 1. We can check it by clicking on the x dataset
under the variable explorer option.

As we can see in the above output image, the first column is added successfully, which
corresponds to the constant term of the MLR equation.

Step: 2:

o Now, we are actually going to apply a backward elimination process. Firstly we will
create a new feature vector x_opt, which will only contain a set of independent
features that are significantly affecting the dependent variable.
o Next, as per the Backward Elimination process, we need to choose a significant
level(0.5), and then need to fit the model with all possible predictors. So for fitting the
model, we will create a regressor_OLS object of new
class OLS of statsmodels library. Then we will fit it by using the fit() method.

80
o Next we need p-value to compare with SL value, so for this we will
use summary() method to get the summary table of all the values. Below is the code
for it:

x_opt=x [:, [0,1,2,3,4,5]]


regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()
Output: By executing the above lines of code, we will get a summary table. Consider the below
image:

In the above image, we can clearly see the p-values of all the variables. Here x1, x2 are
dummy variables, x3 is R&D spend, x4 is Administration spend, and x5 is Marketing
spend.

From the table, we will choose the highest p-value, which is for x1=0.953 Now, we have the
highest p-value which is greater than the SL value, so will remove the x1 variable (dummy
variable) from the table and will refit the model. Below is the code for it:

x_opt=x[:, [0,2,3,4,5]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()

Output:

81
As we can see in the output image, now five variables remain. In these variables, the highest
p-value is 0.961. So we will remove it in the next iteration.

o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:

x_opt= x[:, [0,3,4,5]]


regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()

82
As we can see in the output image, now five variables remain. In these variables, the highest
p-value is 0.961. So we will remove it in the next iteration.

o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:

x_opt= x[:, [0,3,4,5]]


regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()

Output:

In the above output image, we can see the dummy variable(x2) has been removed. And the
next highest value is .602, which is still greater than .5, so we need to remove it.

o Now we will remove the Admin spend which is having .602 p-value and again refit
the model.

x_opt=x[:, [0,3,5]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()

Output:

83
As we can see in the above output image, the variable (Admin spend) has been removed. But
still, there is one variable left, which is marketing spend as it has a high p-value (0.60). So
we need to remove it.

o Finally, we will remove one more variable, which has .60 p-value for marketing
spend, which is more than a significant level.
Below is the code for it:

x_opt=x[:, [0,3]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()

Output:

84
As we can see in the above output image, only two variables are left. So only the R&D independent
variable is a significant variable for the prediction. So we can now predict efficiently using this
variable.

Estimating the performance:

In the previous topic, we have calculated the train and test score of the model when we have
used all the features variables. Now we will check the score with only one feature variable
(R&D spend). Our dataset now looks like:

85
Below is the code for Building Multiple Linear Regression model by only using R&D
spend:

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('50_CompList1.csv')

#Extracting Independent and dependent Variable


x_BE= data_set.iloc[:, :-1].values
y_BE= data_set.iloc[:, 1].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_BE_train, x_BE_test, y_BE_train, y_BE_test= train_test_split(x_BE, y_BE, test_si
ze= 0.2, random_state=0)

#Fitting the MLR model to the training set:

86
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(nm.array(x_BE_train).reshape(-1,1), y_BE_train)

#Predicting the Test set result;


y_pred= regressor.predict(x_BE_test)

#Cheking the score


print('Train Score: ', regressor.score(x_BE_train, y_BE_train))
print('Test Score: ', regressor.score(x_BE_test, y_BE_test))

Output:

After executing the above code, we will get the Training and test scores as:

Train Score: 0.9449589778363044


Test Score: 0.9464587607787219

As we can see, the training score is 94% accurate, and the test score is also 94% accurate.
The difference between both scores is .00149. This score is very much close to the previous
score, i.e., 0.0154, where we have included all the variables.

87
ML Polynomial Regression:
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n


o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we


have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way using
the below comparison diagram of the linear dataset and non-linear dataset.

88
o In the above image, we have taken a dataset which is arranged non-linearly. So if we
try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.

Equation of the Polynomial Regression Model:

Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn ..........(c)

When we compare the above three equations, we can clearly see that all three equations are
Polynomial equations but differ by the degree of variables. The Simple and Multiple Linear
equations are also Polynomial equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree. So if we add a degree to our linear equations,
then it will be converted into Polynomial Linear equations.

Implementation of Polynomial Regression using Python:

Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first,
let's understand the problem for which we are going to build the model.

Problem Description: There is a Human Resource company, which is going to hire a new
candidate. The candidate has told his previous salary 160K per annum, and the HR have to
check whether he is telling the truth or bluff. So to identify this, they only have a dataset of
his previous company in which the salaries of the top 10 positions are mentioned with their
levels. By checking the dataset available, we have found that there is a non-linear
relationship between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the steps to build
such a model.

89
Steps for Polynomial Regression:

The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

Data Pre-processing Step:

o The data pre-processing step will remain the same as in previous regression models,
except for some changes. In the Polynomial Regression model, we will not use feature
scaling, and also we will not split our dataset into training and test set. It has two
reasons:

o The dataset contains very less information which is not suitable to divide it into a test
and training set, else our model will not be able to find the correlations between the
salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should have
enough information.

The code for pre-processing step is given below:

# importing libraries
import numpy as nm

90
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('Position_Salaries.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, 1:2].values
y= data_set.iloc[:, 2].values

Explanation:

o In the above lines of code, we have imported the important Python libraries to import
dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three
columns (Position, Levels, and Salary), but we will consider only two columns
(Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the
dataset. For x-variable, we have taken parameters as [:,1:2], because we want 1
index(levels), and included :2 to make it as a matrix.

Output:

By executing the above code, we can read our dataset as:

91
As we can see in the above output, there are three columns present (Positions, Levels, and
Salaries). But we are only considering two columns because Positions are equivalent to the
levels or may be seen as the encoded form of Positions.

Here we will predict the output for level 6.5 because the candidate has 4+ years' experience
as a regional manager, so he must be somewhere between levels 7 and 6.

Building the Linear regression model:

Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the results. The
code is given below:

#Fitting the Linear Regression to the dataset


from sklearn.linear_model import LinearRegression
lin_regs= LinearRegression()
lin_regs.fit(x,y)

In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).

Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little different from the
Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our dataset.

#Fitting the Polynomial regression to the dataset


from sklearn.preprocessing import PolynomialFeatures
poly_regs= PolynomialFeatures(degree= 2)
x_poly= poly_regs.fit_transform(x)
lin_reg_2 =LinearRegression()
lin_reg_2.fit(x_poly, y)

In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.

92
After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:

Next, we have used another LinearRegression object, namely lin_reg_2, to fit


our x_poly vector to the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,


normali

Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple Linear
Regression. Below is the code for it:

#Visulaizing the result for Linear Regression model


mtp.scatter(x,y,color="blue")
mtp.plot(x,lin_regs.predict(x), color="red")
mtp.title("Bluff detection model(Linear Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()

93
Output:

In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression:

Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.

Code for this is given below:

#Visulaizing the result for Polynomial Regression


mtp.scatter(x,y,color="blue")
mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
mtp.title("Bluff detection model(Polynomial Regression)")
mtp.xlabel("Position Levels")
mtp.ylabel("Salary")
mtp.show()
In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead of
x_poly, because we want a Linear regressor object to predict the polynomial features matrix.

94
Output:

As we can see in the above output image, the predictions are close to the real values. The
above plot will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in the below
image.

SO as we can see here in the above output image, the predicted salary for level 6.5 is near to
170K$-190k$, which seems that future employee is saying the truth about his salary.

Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.

95
Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see whether an
employee is saying truth or bluff. So, for this, we will use the predict() method and will pass
the value 6.5. Below is the code for it:

lin_pred = lin_regs.predict([[6.5]])
print(lin_pred)

Output:

[330378.78787879]

Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to compare
with Linear model. Below is the code for it:

poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
print(poly_pred)

Output:

[158862.45265153]

As we can see, the predicted output for the Polynomial Regression is [158862.45265153],
which is much closer to real value hence, we can say that future employee is saying true.

96
97

You might also like