MACHINE LEARNING
MACHINE LEARNING
ideas. Students and professionals in the workforce can benefit from our machine-learning
tutorial.
Machine learning allows computers to learn from previous data automatically, a rapidly
developing field of technology. For building mathematical models and making predictions
based on historical data or information, machine learning employs a variety of algorithms. It
is currently used for various tasks, including speech recognition, email filtering, auto-tagging
on Facebook, a recommender system, and image recognition.
You will learn about the many different methods of machine learning, including
reinforcement learning, supervised learning, and unsupervised learning, in this machine
learning tutorial. Regression and classification models, clustering techniques, hidden Markov
models, and various sequential models will all be covered.
1
Without being explicitly programmed, machine learning enables a machine to automatically
learn from data, improve performance from experiences, and predict things.
Machine learning algorithms create a mathematical model that, without being explicitly
programmed, aids in making predictions or decisions with the assistance of sample historical
data, or training data. To develop predictive models, machine learning brings together
statistics and computer science. Algorithms that learn from historical data are either
constructed or utilized in machine learning. The performance will rise in proportion to the
quantity of information we provide.
A machine can learn if it can gain more data to improve its performance.
Let's say we have a complex problem in which we need to make predictions. Instead of
writing code, we just need to feed the data to generic algorithms, which build the logic based
on the data and predict the output. Our perspective on the issue has changed because of
machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:
2
By providing them with a large amount of data and allowing them to automatically explore
the data, build models, and predict the required output, we can train machine learning
algorithms. The cost function can be used to determine the amount of data and the machine
learning algorithm's performance. We can save both time and money by using machine
learning.
The significance of AI can be handily perceived by its utilization cases, Presently, AI is
utilized in self-driving vehicles, digital misrepresentation identification, face
acknowledgment, and companion ideas by Facebook, and so on. Different top organizations,
for example, Netflix and Amazon have constructed AI models that utilize an immense
measure of information to examine the client's interest and suggest items.
Following are some key points which show the importance of Machine Learning:
Rapid increment in the production of data
Solving complex problems, which are difficult for a human
Decision-making in various sectors including finance
Finding hidden patterns and extracting useful information from data.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
3
Supervised Learning
In supervised learning, sample-labelled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labelled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning.
Managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.
Supervised learning can be grouped further into two categories of algorithms:
Classification
Regression
Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with a set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal
of unsupervised learning is to restructure the input data into new features or a group of
objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from a huge amount of data. It can be further classified into two categories of
algorithms:
Clustering
Association
Reinforcement Learning
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.
4
machine learning is so old and has a long history. Below some milestones are given which
have occurred in the history of machine learning:
5
1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more it
played.
1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The first "AI" winter:
The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.
In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.
Machine Learning from theory to reality
1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.
1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in
one week.
1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
Machine Learning at 21st Century
2006:
Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine learning
models.
2007:
Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.
Support learning made critical progress when a group of specialists utilized it to
prepare a PC to play backgammon at a top-notch level.
2008:
Google delivered the Google Forecast Programming interface, a cloud-based help that
permitted designers to integrate AI into their applications.
Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.
2009:
Profound learning gained ground as analysts showed its viability in different errands,
including discourse acknowledgment and picture grouping.
6
The expression "Large Information" acquired ubiquity, featuring the difficulties and
open doors related with taking care of huge datasets.
2010:
The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was
presented, driving progressions in PC vision, and prompting the advancement of
profound convolutional brain organizations (CNNs).
2011:
On Jeopardy! IBM's Watson defeated human champions., demonstrating the potential
of question-answering systems and natural language processing.
2012:
AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,
fundamentally further developing picture order precision and laying out profound
advancing as a predominant methodology in PC vision.
Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized profound
figuring out how to prepare a brain organization to perceive felines from unlabeled
YouTube recordings.
2013:
Ian Goodfellow introduced generative adversarial networks (GANs), which made it
possible to create realistic synthetic data.
Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.
2014:
Facebook presented the DeepFace framework, which accomplished close human
precision in facial acknowledgment.
AlphaGo, a program created by DeepMind at Google, defeated a world champion Go
player and demonstrated the potential of reinforcement learning in challenging games.
2015:
Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-
source profound learning library.
The performance of sequence-to-sequence models in tasks like machine translation
was enhanced by the introduction of the idea of attention mechanisms.
2016:
The goal of explainable AI, which focuses on making machine learning models easier
to understand, received some attention.
Google's DeepMind created AlphaGo Zero, which accomplished godlike Go abilities
to play without human information, utilizing just support learning.
7
2017:
Move learning acquired noticeable quality, permitting pre-trained models to be
utilized for different errands with restricted information.
Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
These are only a portion of the eminent headways and achievements in AI during the
predefined period. The field kept on advancing quickly past 2017, with new leap forwards,
strategies, and applications arising.
Machine Learning at present:
The field of machine learning has made significant strides in recent years, and its applications
are numerous, including self-driving cars, Amazon Alexa, Catboats, and the recommender
system. It incorporates clustering, classification, decision trees, SVM algorithms, and
reinforcement learning, as well as unsupervised and supervised learning.
Present-day AI models can be utilized for making different expectations, including climate
expectation, sickness forecast, financial exchange examination, and so on.
8
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
9
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
10
6. Email Spam and Malware Filtering:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
11
Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short-term memory
neural network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
12
The most important thing in the complete process is to understand the problem and to know
the purpose of the problem. Therefore, before starting the life cycle, we need to understand
the problem because the good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.
This step includes the below tasks:
Identify various data sources
Collect data
Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.
13
2. Data Preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
Data exploration:
It is used to understand the nature of the data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find Correlations,
general trends, and outliers.
Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a usable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that the data we have collected is always for our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
Missing Values
Duplicate data
Invalid data
Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
Selection of analytical techniques
Building models
Review the result
14
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques such as Classification,
Regression, Cluster analysis, Association, etc. then build the model using prepared data, and
evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is like making the final report for a project.
Although these are two related technologies and sometimes people use them as a synonym
for each other, but still, both are the two different terms in various cases.
“AI is a bigger concept to create intelligent machines that can simulate human thinking
capability and behaviour, whereas, machine learning is an application or subset of AI that
allows machines to learn from data without being programmed explicitly.”
15
Below are some main differences between AI and machine learning along with the overview
of Artificial intelligence and machine learning.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can
mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which
means "a human-made thinking power." Hence, we can define it as,
“Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.”
The Artificial intelligence system does not require to be pre-programmed, instead of that,
they use such algorithms which can work with their own intelligence. It involves machine
learning algorithms such as Reinforcement learning algorithm and deep learning neural
networks. AI is being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess
playing, etc.
Weak AI
General AI
Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for
which it is said that it will be intelligent than humans.
16
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from
past data or experiences without being explicitly programmed.
Machine learning enables a computer system to make predictions or take some decisions
using historical data without being explicitly programmed. Machine learning uses a massive
amount of structured and semi-structured data so that a machine learning model can generate
accurate result or give predictions based on that data.
Machine learning works on algorithm which learn by it?s own using historical data. It works
only for specific domains such as if we are creating a machine learning model to detect
pictures of dogs, it will only give result for dog images, but if we provide a new data like cat
image then it will become unresponsive. Machine learning is being used in various places
such as for online recommender system, for Google search algorithms, Email spam filter,
Facebook Auto friend tagging suggestion, etc.
Supervised learning
Reinforcement learning
Unsupervised learning
The goal of AI is to make a smart The goal of ML is to allow machines to learn from
computer system like humans to solve data so that they can give accurate output.
complex problems.
In AI, we make intelligent systems to In ML, we teach machines with data to perform a
perform any task like a human. particular task and give an accurate result.
Machine learning and deep learning are Deep learning is a main subset of machine learning.
the two main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
17
AI is working to create an intelligent Machine learning is working to create machines
system which can perform various that can perform only those specific tasks for which
complex tasks. they are trained.
The main applications of AI are Siri, The main applications of machine learning
customer support using catboats, are Online recommender system, Google search
Expert System, Online game playing, algorithms, Facebook auto friend tagging
intelligent humanoid robot, etc. suggestions, etc.
On the basis of capabilities, AI can be Machine learning can also be divided into mainly
divided into three types, which three types that are Supervised
are, Weak AI, General AI, and Strong learning, Unsupervised learning,
AI. and Reinforcement learning.
It includes learning, reasoning, and self- It includes learning and self-correction when
correction. introduced with new data.
AI completely deals with Structured, Machine learning deals with Structured and semi-
semi-structured, and unstructured data. structured data.
18
How to get datasets for Machine Learning
The field of ML depends vigorously on datasets for preparing models and making precise
predictions. Datasets assume a vital part in the progress of AIML projects and are
fundamental for turning into a gifted information researcher. In this article, we will
investigate the various sorts of datasets utilized in AI and give a definite aid on where to track
down them.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can
contain any data from a series of an array to a database table. Below table shows an example
of the dataset:
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently.
19
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A
few normal sorts of datasets utilized in machine learning include:
Image Datasets:
Image datasets contain an assortment of images and are normally utilized in computer vision
tasks such as image classification, object detection, and image segmentation.
Examples :
o ImageNet
o CIFAR-10
o MNIST
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment
posts. These datasets are utilized in NLP techniques like sentiment analysis, text
classification, and machine translation.
Examples :
Time series datasets include information focuses gathered after some time. They are
generally utilized in determining, abnormality location, and pattern examination.
Examples :
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They
contain lines addressing examples or tests and segments addressing highlights or qualities.
20
Tabular datasets are utilized for undertakings like relapse and arrangement. The dataset given
before in the article is an illustration of a tabular dataset.
Need of Dataset
o Completely ready and pre-handled datasets are significant for machine learning
projects.
o They give the establishment to prepare exact and solid models. Notwithstanding,
working with enormous datasets can introduce difficulties regarding the board and
handling.
o To address these difficulties, productive information the executive's strategies and are
expected to handle calculations.
Data Pre-processing:
During the development of the ML project, the developers completely rely on the datasets. In
building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
21
Note: The datasets are of large size, so to download these datasets, you must have fast
internet on your computer.
In machine learning, datasets are ordinarily partitioned into two sections: the training dataset
and the test dataset. The training dataset is utilized to prepare the machine learning model,
while the test dataset is utilized to assess the model's exhibition. This division surveys the
model's capacity, to sum up to inconspicuous data. It is fundamental to guarantee that the
datasets are representative of the issue space and appropriately split to stay away from
inclination or overfitting.
Below is the list of datasets which are freely available for the public to work on it:
22
1. Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also
provides the opportunity to work with other machine learning engineers and solve difficult
Data Science related tasks.
Kaggle provides a high-quality dataset in different formats that we can easily find and
download.
The UCI Machine Learning Repository is an important asset that has been broadly utilized by
scientists and specialists beginning around 1987. It contains a huge collection of datasets
sorted by machine learning tasks such as regression, classification, and clustering.
Remarkable datasets in the storehouse incorporate the Iris dataset, Vehicle Assessment
dataset, and Poker Hand dataset.
23
The link for the UCI machine learning repository is https://archive.ics.uci.edu/ml/index.php.
We can search, download, access, and share the datasets that are publicly available via AWS
resources. These datasets can be accessed through AWS resources but provided and
maintained by different government organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS resources. The
shared dataset on cloud helps users to spend more time on data analysis rather than on
acquisitions of data.
24
This source provides the various types of datasets with examples and ways to use the dataset.
It also provides the search box using which we can search for the required dataset. Anyone
can add any dataset or example to the Registry of Open Data on AWS.
Google's Dataset Web index helps scientists find and access important datasets from different
sources across the web. It files datasets from areas like sociologies, science, and
environmental science. Specialists can utilize catchphrases to find datasets, channel results in
light of explicit standards, and access the datasets straightforwardly from the source.
5. Microsoft Datasets
The Microsoft has launched the "Microsoft Research Open data" repository with the
collection of free datasets in various areas such as natural language processing, computer
vision, and domain-specific sciences. It gives admittance to assorted and arranged datasets
that can be significant for machine learning projects.
25
6. Awesome Public Dataset Collection
Awesome public dataset collection provides high-quality datasets that are arranged in a well-
organized manner within a list according to topics such as Agriculture, Biology, Climate,
Complex networks, etc. Most of the datasets are available free, but some may not, so it is
better to check the license before downloading the dataset.
The link to download the dataset from Awesome public dataset collection
is https://github.com/awesomedata/awesome-public-datasets.
7. Government Datasets
There are different sources to get government-related data. Various countries publish
government data for public use collected by them from different departments.
26
The goal of providing these datasets is to increase transparency of government work among
the people and to use the data in an innovative approach. Below are some links of
government datasets:
Visual data provides multiple numbers of the great dataset that are specific to computer
visions such as Image Classification, Video classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep learning or image processing, then you can
refer to this source.
The link for downloading the dataset from this source is https://www.visualdata.io/.
9. Scikit-learn dataset
27
The link to download datasets from this source
is https://scikit-learn.org/stable/datasets/index.html.
Data ethics and privacy are basic contemplations in machine learning projects. It is
fundamental to guarantee that data is gathered and utilized morally, regarding privacy
freedoms and observing pertinent regulations and guidelines. Data experts ought to go to
lengths to safeguard data privacy, get appropriate assent, and handle delicate data mindfully.
Assets, for example, moral rules and privacy structures can give direction on keeping up with
moral practices in data assortment and use.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data preprocessing task.
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
28
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
Importing libraries
Importing datasets
Feature scaling
To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.
CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning. For
real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, https://archive.ics.uci.edu/ml/index.php etc.
We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.
2) Importing Libraries
29
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot. This library is used to plot any type
of charts in Python for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:
Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.
30
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:
31
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.
32
3. ['Germany' 30.0 54000.0]
4. ['France' 48.0 65000.0]
5. ['Germany' 40.0 nan]
6. ['India' 35.0 58000.0]
7. ['Germany' nan 53000.0]
8. ['France' 49.0 79000.0]
9. ['India' 50.0 88000.0]
10. ['France' 37.0 77000.0]]
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
33
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means of
rest column values.
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
34
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
In above code, we have imported LabelEncoder class of sklearn library. This class has
successfully encoded the variables into digits.
But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
35
Output:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
1. labelencoder_y= LabelEncoder()
36
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.
Output:
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
37
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
For splitting the dataset, we will use the below lines of code:
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
AD
Output:
By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
38
As we can see in the above image, the x and y variables are divided into 4 different variables
with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.
As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then it
will cause some issue in our machine learning model.
If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.
39
There are two ways to perform feature scaling in machine learning:
Standardization
Normalization
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
40
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0
and 1. But if these variables will have more range of values, then we will also need to scale
those variables.
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
41
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
42
In the above code, we have included all the data preprocessing steps together. But there are
some steps or lines of code which are not necessary for all machine learning models. So we
can exclude them from our code to make it reusable for all models.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.
43
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
44
Types of supervised Machine learning Algorithms:
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
45
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
46
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
47
Here, we have taken an unlabelled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabelled input data is fed to the
machine learning model to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between
the data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
48
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
SUPERVISED LEARNING
Regression Analysis in Machine learning:
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
49
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.
50
o Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyse the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
51
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.
52
Below is the mathematical equation for Linear regression:
Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
53
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
Binary(0/1, pass/fail)
Multi(cats, dogs, lions)
Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.
54
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x
is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
55
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
56
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output. The combined decision trees are called as base models, and it can be
represented more formally as:
57
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
58
Linear Regression in Machine Learning:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
59
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
Multiple Linear regression: If more than one independent variable is used to predict the
value of a numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
Positive Linear Relationship: If the dependent variable increases on the Y-axis and
independent variable increases on X-axis, then such a relationship is termed as a Positive
linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a relationship is called a negative
linear relationship.
60
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient for
the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will
be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
61
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.
o Linear relationship between the features and target: Linear regression assumes the
linear relationship between the dependent and independent variables.
o Small or no multicollinearity between the features: Multicollinearity means a high
correlation between the independent variables. Due to multicollinearity, it may be
difficult to find the true relationship between the predictors and target variables. Or
we can say, it is difficult to determine which predictor variable is affecting the target
variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.
62
o Homoscedasticity Assumption: Homoscedasticity is a situation when the error term
is the same for all the values of independent variables. With homoscedasticity, there
should be no clear pattern distribution of data in the scatter plot.
o Normal distribution of error terms: Linear regression assumes that the error term
should follow the normal distribution pattern. If error terms are not normally
distributed, then confidence intervals will become either too wide or too narrow,
which may cause difficulties in finding coefficients. It can be checked using the q-q
plot. If the plot shows a straight line without any deviation, which means the error is
normally distributed.
o No autocorrelations: The linear regression model assumes no autocorrelation in error
terms. If there is any correlation in the error term, then it will drastically reduce the
accuracy of the model. Autocorrelation usually occurs if there is a dependency
between residual errors.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience, and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. (For a good model it will be negligible)
63
Implementation of Simple Linear Regression Algorithm using Python:
Here we are taking a dataset that has two variables: salary (dependent variable) and
experience (Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting
line for representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we
need to follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. We
have already done it earlier in this tutorial. But there will be some changes, which are given
in the below steps:
o First, we will import the three important libraries, which will help us for loading the
dataset, plotting the graphs, and creating the Simple Linear Regression model.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
Next, we will load the dataset into our code:
data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable explorer option.
64
The above output shows the dataset, which has two variables: Salary and Experience.
After that, we need to extract the dependent and independent variables from the given
dataset. The independent variable is years of experience, and the dependent variable is
salary. Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to
remove the last column from the dataset. For y variable, we have taken 1 value as a
parameter, since we want to extract the second column and indexing starts from the
zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent)
variable has been extracted from the given dataset.
65
Next, we will split both variables into the test set and training set. We have 30
observations, so we will take 20 observations for the training set and 10 observations
for the test set. We are splitting our dataset so that we can train our model using a
training dataset and then test the model using a test dataset. The code for this is given
below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset.
Consider the below images:
Test-dataset:
Training Dataset:
66
For simple linear Regression, we will not use Feature Scaling. Because Python
libraries take care of it for some cases, so we don't need to perform it here. Now, our
dataset is well prepared to work on it and we are going to start building a Simple
Linear Regression model for the given problem.
Step-2: Fitting the Simple Linear Regression to the Training Set:
Now the second step is to fit our model to the training dataset. To do so, we will import the
Linear Regression class of the linear model library from the scikit learn. After importing the
class, we are going to create an object of the class named as a regressor. The code for this is
given below:
#Fitting the Simple Linear Regression model to the training dataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to
the training set. In the fit() function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent variable. We have fitted our regressor
object to the training set so that the model can easily learn the correlations between the
predictor and target variables. After executing the above lines of code, we will get the below
output.
Output:
67
You can check the variable by clicking on the variable explorer option in the IDE, and also
compare the result by comparing values from y_pred and y_test. By comparing these values,
we can check how good our model is performing.
Step: 4. visualizing the Training set results:
Now in this step, we will visualize the training set result. To do so, we will use the scatter()
function of the pyplot library, which we have already imported in the pre-processing step.
The scatter () function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are
taking a green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the
pyplot library. In this function, we will pass the years of experience for training set, predicted
salary for training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot
library and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
68
In the above plot, there are observations given by the blue color, and prediction is given by
the red regression line. As we can see, most of the observations are close to the regression
line, hence we can say our Simple Linear Regression is a good model and able to make good
predictions.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
Some key points about MLR:
For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
Each feature variable must model the linear relationship with the dependent variable.
MLR tries to fit a regression line through a multidimensional space of data-points
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression,
so the same is applied for the multiple linear regression equation, the equation becomes:
Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</
sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)
Where,
Y= Output/Response variable
69
Assumptions for Multiple Linear Regression:
o linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.
Problem Description:
Since we need to find the Profit, so it is the dependent variable, and the other four variables
are independent variables. Below are the main steps of deploying the MLR model:
The very first step is data pre-processing, which we have already discussed in this tutorial.
This process contains the below steps:
Importing libraries: Firstly we will import the library which will help in building the
model. Below is the code for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains
all the variables. Below is the code for it:
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
70
Output: We will get the dataset as:
In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable.
Output:
Out[5]:
71
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)
As we can see in the above output, the last column contains categorical variables which are
not suitable to apply directly for fitting the model. So we need to encode this variable.
As we have one categorical variable (State), which cannot be directly applied to the model, so
we will encode it. To encode the categorical variable into numbers, we will use
the LabelEncoder class. But it is not sufficient because it still has some relational order,
which may create a wrong model. So in order to remove this problem, we will
use OneHotEncoder, which will create the dummy variables. Below is code for it:
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables are continuous
72
Output:
As we can see in the above output, the state column has been converted into dummy variables (0 and
1). Here each dummy variable column is corresponding to the one State. We can check by
comparing it with the original dataset. The first column corresponds to the California State, the
second column corresponds to the Florida State, and the third column corresponds to the New York
State.
o Now, we are writing a single line of code just to avoid the dummy variable trap:
73
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given
below:
The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can check the
output by clicking on the variable explorer option given in Spyder IDE. The test set and
training set will look like the below image:
Test set:
74
Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit
our regression model to the training set. It will be similar to as we did in Simple Linear
Regression model. The code for this will be:
Output:
Now, we have successfully trained our model using the training dataset. In the next step, we
will test the performance of the model using the test dataset.
The last step for our model is checking the performance of the model. We will do it by
predicting the test set result. For prediction, we will create a y_pred vector. Below is the
code for it:
By executing the above lines of code, a new vector will be generated under the variable
explorer option. We can test our model by comparing the predicted values and test set values.
75
Output:
In the above output, we have predicted result set and test set. We can check model performance by
comparing these two-value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which is a
good prediction, so, finally, our model is completed here.
o We can also check the score for training dataset and test dataset. Below is the code for
it:
76
What is Backward Elimination?
Backward elimination is a feature selection technique while building a machine learning
model. It is used to remove those features that do not have a significant effect on the
dependent variable or prediction of output. There are various ways to build a model in
Machine Learning, which are:
1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
Above are the possible methods for building the model in Machine learning, but we will only
use here the Backward Elimination process as it is the fastest method.
Below are some main steps which are used to apply backward elimination process:
Step-1: Firstly, We need to select a significance level to stay in the model. (SL=0.05)
Step-2: Fit the complete model with all possible predictors/independent variables.
Step-3: Choose the predictor which has the highest P-value, such that.
77
Step-4: Remove that predictor.
Step-5: Rebuild and fit the model with the remaining variables.
In the previous chapter, we discussed and successfully created our Multiple Linear
Regression model, where we took 4 independent variables (R&D spend, Administration
spends, Marketing spends, and state (dummy variables)) and one dependent variable
(Profit). But that model is not optimal, as we have included all the independent variables and
do not know which independent model is most affecting and which one is the least affecting
for the prediction.
Unnecessary features increase the complexity of the model. Hence it is good to have only the
most significant features and keep our model simple to get the better result.
So, to optimize the performance of the model, we will use the Backward Elimination method. This
process is used to optimize the performance of the MLR model as it will only include the most
affecting feature and remove the least affecting feature. Let us start to apply it to our MLR model.
We will use the same model which we build in the previous chapter of MLR. Below is the
complete code for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
#Catgorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x= LabelEncoder()
x[:, 3]= labelencoder_x.fit_transform(x[:,3])
onehotencoder= OneHotEncoder(categorical_features= [3])
x= onehotencoder.fit_transform(x).toarray()
78
#Avoiding the dummy variable trap:
x = x[:, 1:]
From the above code, we got training and test set result as:
79
x = nm.append(arr = nm.ones((50,1)).astype(int), values=x, axis=1)
Here we have used axis =1, as we wanted to add a column. For adding a row, we can use axis
=0.
Output: By executing the above line of code, a new column will be added into our matrix of
features, which will have all values equal to 1. We can check it by clicking on the x dataset
under the variable explorer option.
As we can see in the above output image, the first column is added successfully, which
corresponds to the constant term of the MLR equation.
Step: 2:
o Now, we are actually going to apply a backward elimination process. Firstly we will
create a new feature vector x_opt, which will only contain a set of independent
features that are significantly affecting the dependent variable.
o Next, as per the Backward Elimination process, we need to choose a significant
level(0.5), and then need to fit the model with all possible predictors. So for fitting the
model, we will create a regressor_OLS object of new
class OLS of statsmodels library. Then we will fit it by using the fit() method.
80
o Next we need p-value to compare with SL value, so for this we will
use summary() method to get the summary table of all the values. Below is the code
for it:
In the above image, we can clearly see the p-values of all the variables. Here x1, x2 are
dummy variables, x3 is R&D spend, x4 is Administration spend, and x5 is Marketing
spend.
From the table, we will choose the highest p-value, which is for x1=0.953 Now, we have the
highest p-value which is greater than the SL value, so will remove the x1 variable (dummy
variable) from the table and will refit the model. Below is the code for it:
x_opt=x[:, [0,2,3,4,5]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()
Output:
81
As we can see in the output image, now five variables remain. In these variables, the highest
p-value is 0.961. So we will remove it in the next iteration.
o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:
82
As we can see in the output image, now five variables remain. In these variables, the highest
p-value is 0.961. So we will remove it in the next iteration.
o Now the next highest value is 0.961 for x1 variable, which is another dummy
variable. So we will remove it and refit the model. Below is the code for it:
Output:
In the above output image, we can see the dummy variable(x2) has been removed. And the
next highest value is .602, which is still greater than .5, so we need to remove it.
o Now we will remove the Admin spend which is having .602 p-value and again refit
the model.
x_opt=x[:, [0,3,5]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()
Output:
83
As we can see in the above output image, the variable (Admin spend) has been removed. But
still, there is one variable left, which is marketing spend as it has a high p-value (0.60). So
we need to remove it.
o Finally, we will remove one more variable, which has .60 p-value for marketing
spend, which is more than a significant level.
Below is the code for it:
x_opt=x[:, [0,3]]
regressor_OLS=sm.OLS(endog = y, exog=x_opt).fit()
regressor_OLS.summary()
Output:
84
As we can see in the above output image, only two variables are left. So only the R&D independent
variable is a significant variable for the prediction. So we can now predict efficiently using this
variable.
In the previous topic, we have calculated the train and test score of the model when we have
used all the features variables. Now we will check the score with only one feature variable
(R&D spend). Our dataset now looks like:
85
Below is the code for Building Multiple Linear Regression model by only using R&D
spend:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('50_CompList1.csv')
86
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(nm.array(x_BE_train).reshape(-1,1), y_BE_train)
Output:
After executing the above code, we will get the Training and test scores as:
As we can see, the training score is 94% accurate, and the test score is also 94% accurate.
The difference between both scores is .00149. This score is very much close to the previous
score, i.e., 0.0154, where we have included all the variables.
87
ML Polynomial Regression:
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:
88
o In the above image, we have taken a dataset which is arranged non-linearly. So if we
try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
When we compare the above three equations, we can clearly see that all three equations are
Polynomial equations but differ by the degree of variables. The Simple and Multiple Linear
equations are also Polynomial equations with a single degree, and the Polynomial regression
equation is Linear equation with the nth degree. So if we add a degree to our linear equations,
then it will be converted into Polynomial Linear equations.
Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first,
let's understand the problem for which we are going to build the model.
Problem Description: There is a Human Resource company, which is going to hire a new
candidate. The candidate has told his previous salary 160K per annum, and the HR have to
check whether he is telling the truth or bluff. So to identify this, they only have a dataset of
his previous company in which the salaries of the top 10 positions are mentioned with their
levels. By checking the dataset available, we have found that there is a non-linear
relationship between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the steps to build
such a model.
89
Steps for Polynomial Regression:
o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.
o The data pre-processing step will remain the same as in previous regression models,
except for some changes. In the Polynomial Regression model, we will not use feature
scaling, and also we will not split our dataset into training and test set. It has two
reasons:
o The dataset contains very less information which is not suitable to divide it into a test
and training set, else our model will not be able to find the correlations between the
salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should have
enough information.
# importing libraries
import numpy as nm
90
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Position_Salaries.csv')
Explanation:
o In the above lines of code, we have imported the important Python libraries to import
dataset and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three
columns (Position, Levels, and Salary), but we will consider only two columns
(Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the
dataset. For x-variable, we have taken parameters as [:,1:2], because we want 1
index(levels), and included :2 to make it as a matrix.
Output:
91
As we can see in the above output, there are three columns present (Positions, Levels, and
Salaries). But we are only considering two columns because Positions are equivalent to the
levels or may be seen as the encoded form of Positions.
Here we will predict the output for level 6.5 because the candidate has 4+ years' experience
as a regional manager, so he must be somewhere between levels 7 and 6.
Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the results. The
code is given below:
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Now we will build the Polynomial Regression model, but it will be a little different from the
Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our dataset.
In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.
92
After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:
Output:
Now we will visualize the result for Linear regression model as we did in Simple Linear
Regression. Below is the code for it:
93
Output:
In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.
94
Output:
As we can see in the above output image, the predictions are close to the real values. The
above plot will vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in the below
image.
SO as we can see here in the above output image, the predicted salary for level 6.5 is near to
170K$-190k$, which seems that future employee is saying the truth about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.
95
Predicting the final result with the Linear Regression model:
Now, we will predict the final output using the Linear regression model to see whether an
employee is saying truth or bluff. So, for this, we will use the predict() method and will pass
the value 6.5. Below is the code for it:
lin_pred = lin_regs.predict([[6.5]])
print(lin_pred)
Output:
[330378.78787879]
Now, we will predict the final output using the Polynomial Regression model to compare
with Linear model. Below is the code for it:
poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
print(poly_pred)
Output:
[158862.45265153]
As we can see, the predicted output for the Polynomial Regression is [158862.45265153],
which is much closer to real value hence, we can say that future employee is saying true.
96
97