INTRODUCTION TO
DATA SCIENCE, MACHINE
LEARNING & AI
(FOR PRIVATE CIRCULATION ONLY)
2021
PROGRAMME COORDINATOR
Prof. Ankita Mendiratta
COURSE DESIGN AND REVIEW COMMITTEE
Prof. Pallavi Soman Prof. Pallavi Vartak
Prof. Lalit Kathpalia Dr. Sarika Swarup
Prof. Archana Chaudhary Prof. Sophia Gaikwad
Ms. Jaai Kondhalkar Mr. Abhijit Wakhare
Prof. Ankita Mendiratta
COURSE WRITER
Mr. Sudarshan Shidore Ms. Kaurobi Ghosh
Mr. Nitin Patil Ms. Sonali Karale
Prof. Ankita Mendiratta
EDITOR
Mr. Yogesh Bhosle
Published by Symbiosis Centre for Distance Learning (SCDL), Pune
2021
Copyright © 2021 Symbiosis Open Education Society
All rights reserved. No part of this unit may be reproduced, transmitted or utilised in any form or by any
means, electronic or mechanical, including photocopying, recording or by any information storage or retrieval
system without written permission from the publisher.
Acknowledgement
Every attempt has been made to trace the copyright holders of materials reproduced in this unit. Should any
infringement have occurred, SCDL apologies for the same and will be pleased to make necessary corrections
in future editions of this unit.
PREFACE
This course is designed to enable you to design better, more effective instruction. The word
‘instruction’, as many of you may know already, means education or training or any process of
knowledge transfer (a minor distinction exists between these terms, but for now we may assume that
they are the same).
We shall look at four dimensions of learning – content, learner, context and delivery. Content, of
course, is simply knowledge that needs to be transferred during education or training. The learner is a
student such as yourself, even as you read this. As we shall see later, the modern view is that students
often enhance the process of knowledge transfer. But no content or learner can exist meaningfully
without an underlying context as well. Finally, delivery is the nuts and bolts of instruction. The
rapid pace of technology has meant that instruction is no longer restricted to mode that is used for
communication. Technology has become so vital at the workplace that there will be times when we
must dip into the ‘technological delivery’ aspect of instruction. This has been done keeping in mind
your current and future job profiles.
The course is also ‘action oriented’. Every once in a while you will find yourself checking your
understanding or reflecting on an idea. This is intended to give you both the right ‘pace’ of learning
as well as build confidence as you go along.
Before we start on the course, a word of caution: instructional theory like many theories in the social
sciences (or even the physical or logical sciences) must be applied in the right situation to really
serve its purpose. It is unlikely that you will ever design an instructions in which you have used all
the theories or ideas! This is where your judgment as an instructional designer will come in. This is a
bit like cooking in which you need to ‘balance’ the ingredients to come up with a successful recipe.
So, let us start on our journey. Good luck!
Mr. Sudarshan Shidore
Ms. Kaurobi Ghosh
Mr. Nitin Patil
Ms. Sonali Karale
iii
ABOUT THE AUTHOR
Sudarshan Shidore passed out from IIT Delhi in the year 1994 with an integrated M.Sc in Maths
and Computer Applications. In the year 1996, he received his MBA in Marketing and Systems from
IIM, Ahmedabad.
Sudarshan worked with Leo Burnett Advertising and Infosys Technologies before moving into
the world of e-learning in the year 2000. He has worked with well-known companies such as Tata
Interactive Systems, Magic Software and Learning Mate Solutions, where he headed the courseware
creation team for a year and a half. He has also been a consultant to BrainVisa Technologies and
Harbinger Systems. Clients he has worked for include McGraw Hill, Kaplan University, Allen
Interactions, the University of Phoenix, CIPD (UK), John Wiley and Sons and Pearson Prentice Hall.
Sudarshan has also been a full-time faculty at the prestigious Symbiosis Institute of International
Business between the years 2003 and 2005.
Kaurobi Ghosh is a Training and Content Development Professional and leads her own firm E3
Learning Space which creates learning content for a wide variety of professionals and amateurs.
She studied BA in Psychology from Fergusson College, Pune and completed her Masters in Social
Work from TISS, Mumbai.
Previously, she was associated with IL&FS Education & Technology Services and worked on Content
and Curriculum Development to offer end-to-end customised solutions in behavioural skills, business
& entrepreneurship development skills and teachers’ training for various Government institutions
and private corporates across the country.
Mr. Nitin P. Patil has completed his M.Sc. (Computer Science) and has more than 15 years of
experience in industry and teaching at postgraduate level. Mr. Patil is an experienced corporate
trainer and has been guiding students for projects at postgraduate level. He has also published various
research papers in national and international journals and conferences. He is a member of various
committees and a resource person for various refresher courses.
Prof. Sonali Karale has completed her Masters in Computer Applications and is pursuing her Ph.D.
She has rich experience in teaching at Post Graduate level. She has published various research papers
in International and National Conferences.
iv
CONTENTS
Unit No. TITLE Page No.
1 Basics of Data Science 1-16
1.1 Introduction
1.2 Why Now, Why not earlier?
1.3 What is Data Science?
1.4 Data Science Composition
1.5 Applications & Case Studies
1.6 Core of Data Science – Machine Learning Algorithms
1.7 The Data Science Lifecycle
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
2 Big Data, Datafication & its impact on Data Science 17-32
2.1 Introduction to Big Data
2.2 Big Data, What is it?
2.3 Big Data & Data Science
2.4 Big Data Technologies
2.5 Datafication
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
3 Data Science Pipeline, EDA & Data Preparation 33-50
3.1 Introduction to Data Science Pipeline
3.2 Data Wrangling
3.3 Exploratory Data Analysis
3.4 Data Extraction & Cleansing
3.5 Statistical Modelling
3.6 Data Visualisation
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
v
Unit No. TITLE Page No.
4 Data Scientist Toolbox, Applications & Case Studies 51-66
4.1 Data Scientist’s Toolbox
4.2 Applications & Case Study of Data Science
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Annexure
5 Basics of Machine Learning 67-80
5.1 Introduction
5.2 Basic Concept of Machine Learning
5.3 Classes of Machine Learning Algorithms
5.4 Deep Learning
5.5 Why use R or Python for Machine Learning?
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
6 Supervised Machine Learning 81-104
6.1 Introduction
6.2 Supervised Learning
6.3 Algorithm Types
6.3.1 K-Nearest-Neighbours (KNN) Algorithm
6.3.2 Naïve Bayes Classifier
6.3.3 Decision Tree
6.3.4 Support Vector Machine
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
7 Unsupervised Learning 105-120
7.1 Introduction
7.2 Concept of Unsupervised learning
7.3 Unsupervised Learning Algorithms
Summary
Keywords
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
vi
Unit No. TITLE Page No.
8 Deep Learning 121-130
8.1 Introduction to Deep Learning
8.2 Working/Process of Deep Learning
8.3 Deep Learning and Artificial Neural Networks
8.4 Deep Learning and Artificial Intelligence
8.5 Deep Learning and Machine Learning
8.6 Applications of Deep Learning
Summary
Keywords
Self-Assessment questions
Answers to Check your Progress
Suggested Reading
9 Artificial Intelligence 131-140
9.1 Introduction to Artificial Intelligence
9.2 Characteristics of Artificial Intelligence
9.3 Advantages of Artificial Intelligence
9.4 Components of Artificial Intelligence
9.5 Broad categories of Artificial Intelligence
9.6 Technologies of Artificial Intelligence
9.7 Artificial Intelligence and Machine learning
9.8 Applications of Artificial Intelligence
Summary
Keywords
Self-Assessment questions
Answers to Check your Progress
Suggested Reading
10 Business Intelligence 141-150
10.1 Introduction to Business Intelligence
10.2 Features of Business Intelligence
10.3 Business Intelligence process
10.4 Factors contributing for successful business intelligence
10.5 Business Intelligence and Business Analytics
10.6 Business Intelligence and Big Data
10.7 Advantages and applications of business intelligence Summary
Keywords
Self-Assessment Questions
Answers to Check Questions
Suggested Reading
vii
Unit No. TITLE Page No.
11 Web Analytics 151-158
11.1 Introduction to Web Analytics
11.2 Benefits of Web Analytics
11.3 Process of Web analytics and maturity model of web analytics
11.4 Best practices of Web analytics
11.5 Web analytics tools
11.6 Types of web analytics
Keywords
Self-Assessment Questions
Answers to Check Questions
Suggested Reading
viii
Basics of Data Science
UNIT
1
Structure:
1.1 Introduction
1.2 Why Now, Why not earlier?
1.3 What is Data Science?
1.4 Data Science Composition
1.5 Applications & Case Studies
1.6 Core of Data Science – Machine Learning Algorithms
1.7 The Data Science Lifecycle
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Basics of Data Science 1
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand what Data Science means
----------------------
• Understand why we are studying it now
---------------------- • Applications of Data Science and its uses
----------------------
----------------------
1.1 INTRODUCTION
----------------------
In a world, where on a single tap we are sending friend requests and
---------------------- buying products on a retail website, unknowingly or knowingly we are creating
---------------------- data at every click. With this, we are enabling the companies collecting this data
to help us give recommendations which are personalized in nature.
----------------------
All of this collected data makes a huge impact on how we see our
---------------------- recommendations. Along with this, nowadays, people don’t even own one
phone. There are multiple phones, tabs, laptops through which he data is
---------------------- collected. For example, if you search for a Data Science course online, you will
see the ads of Data Science courses on Facebook, YouTube etc. This is where
----------------------
huge computer with a lot of processing power, data and data science come into
---------------------- play.
In this unit, we are going to discuss about the meaning of Data Science,
----------------------
why are we studying it now and why didn’t we talk of earlier, what exactly does
---------------------- data science mean and some of the applications of Data Science.
---------------------- 1.2 WHY NOW, WHY NOT EARLIER?
----------------------
The least asked question in the world of Data Science is the fact that why
---------------------- are we studying this as a subject now, why not earlier? The answer to these
questions lies in the cost of collection of data. Data Science, as you know is
---------------------- associated with data. But in the early 90’s the cost of collection of data was
humongous. As a result, most of the corporates or company did not store data.
----------------------
Even if they did, most of the stored data was in the form of paper trails which
---------------------- were difficult to comprehend with. As a result, even if somebody thought of
analysing the data, it was very difficult to go back and retrieve records.
----------------------
But by the late 2000s, the cost of storing data had died down exponentially.
---------------------- As a result, it was easier and cheaper to store data digitally. Because of this it
was much easier to do analysis on the data, pull it up when required, transfer it
---------------------- in whenever required etc. In early 90s, storing data used to be associated with
floppy disks and big hard drives which needed to be transported through heavy
----------------------
machinery. That scenario has now changed completely. We now more than 128
---------------------- GB of space in our mobile phones, pocket size hard disks exceeding 2 TBs
2 Introduction to Data Science, Machine Learning & AI
and USB’s storing more than 64 GB of data. That is the reason why we can Notes
transform, analyse and visualize data in easier manner.
----------------------
Data storage was costly during the late 90’s period but what was also not
possible was large computation on the data. The computers present in those days ----------------------
were not capable of processing huge amounts of data. So as the data storage got
cheaper new advanced computers with highly calibrated processing power came ----------------------
into existence. Not only they could store large data, but they could process it
----------------------
quickly too. This also resulted in various discoveries of algorithms which were
more powerful and as a result of this Data Science came in existence. ----------------------
1.3 WHAT IS DATA SCIENCE? ----------------------
Data Science can be defined as an area concerned with collection, ----------------------
preparation, analysis, visualisation, management, modelling and inference ----------------------
building from large chunks of information present. Although, the name seems
connected to databases and management, it is mostly related to business problem ----------------------
solving with the help of statistics and programming. Data Science cannot be
defined as one skill or in a way just analysis. It is a complete branch in which ----------------------
you have to prepare, analyse and transform data so as to have an actionable ----------------------
insight on the business problem you are trying to solve.
----------------------
There are a lot of myths present about Data Science. Some of them are
that these are statisticians in white lab coats, with PhD’s blinking through a ----------------------
computer screen or a coder who is trying to automate a process using some
algorithm. Nothing could be further from the truth. The fact is that Data ----------------------
Scientists can be any of those as well as none of these. Sometimes you only
----------------------
need Microsoft Excel and logical analysis to solve the most complex problems
whereas something you might require a very powerful cloud cluster to get you ----------------------
an answer. Another misconception about Data Science is that it is just analysis.
In fact, Data Science is much more beyond just analysing the data. It is process ----------------------
of extracting meaningful insights on which some action can be taken.
----------------------
The people who work with Data Science problem in and out every day are
called as Data Scientist. The job profile of a Data Scientist can vary. It can start ----------------------
from cultivating or storing a data to a complex mathematical prediction model
----------------------
which needs to be deployed into production environment. A Data Scientist
can also have job roles of automating dash boards to building products and ----------------------
application which can work flawlessly. Thus being a Data Scientist requires
command on ETL (Extract, Transform & Load), BI (Business Intelligence) tools ----------------------
for visualisation, coding command on languages like R, Python and Matlab &
----------------------
SAS along with knowledge on big data tools like Hadoop, Apache Spark etc.
Some of the skills required for Data Science are: ----------------------
1) Knowing the business & application domain ----------------------
A data scientist must know the problem context in terms of application as ----------------------
well as business domain. He/she should know how solving the problem
will impact the business side of things. ----------------------
Basics of Data Science 3
Notes 2) Communication & Looking at Holistic View
A data scientist should be able to communicate the insights & results
----------------------
along with actions required to the business user clearly. They should also
---------------------- have an understanding of the holistic view of the domain and the impact
of recommending this change on the business.
----------------------
3) Data Transformation, Analysis & Visualisation
---------------------- A data scientist should be able to extract, load, & transform data from any
source into usable format. He/she should be able to work with structured
----------------------
& unstructured data so as to find useful insights from it. A data scientist
---------------------- should be able to analyse the data, understand what each and every
variable means and visualize it so as to look at in a snapshot view.
----------------------
4) Modelling & Presentation
---------------------- A data scientist should be able to build statistical algorithms on the data
---------------------- at hand. He/she should also be able to present the results in a way which
the business user can understand and take actions on.
---------------------- Thus, Data Science is a field which has a plethora of fields to be studied,
---------------------- a lot of languages to be learnt and a still a lot of questions to be answered.
And all of these questions are generally answered by Data Scientists.
----------------------
Some of the skills and work ethics required by Data Scientists are:
---------------------- a) Conduct undirected research and frame open-ended industry
questions
----------------------
b) Extract huge volumes of data from multiple internal and external
---------------------- sources
---------------------- c) Employ sophisticated analytics programs, machine learning
and statistical methods to prepare data for use in predictive and
---------------------- prescriptive modelling
---------------------- d) Thoroughly clean and prune data to discard irrelevant information
---------------------- e) Explore and examine data from a variety of angles to determine
hidden weaknesses, trends and/or opportunities
----------------------
f) Form data-driven solutions to the most pressing challenges
---------------------- g) Invent new algorithms to solve problems and build new tools to
automate work
----------------------
h) Communicate predictions and findings to management and IT
---------------------- departments through effective data visualisations and reports
---------------------- i) Recommend cost-effective changes to existing procedures and
strategies
----------------------
----------------------
----------------------
4 Introduction to Data Science, Machine Learning & AI
1.4 DATA SCIENCE COMPOSITION Notes
Data Science mainly comprises of 4 major components: ----------------------
1) Domain Knowledge Expertise ----------------------
2) Software Programming
----------------------
3) Statistics
----------------------
4) Communication
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Figure Source - http://rpubs.com/ptherond/DSc_Venn ----------------------
Domain Knowledge Expertise ----------------------
Data science is all about solving the business problem. The most imperative
----------------------
step is to understand the impact of a problem on a business. While solving a
problem one needs to understand the intricacies of a particular business in a ----------------------
domain.
----------------------
For example, a financial domain problem will have different meaning
of variables as compared to a retail domain problem. Another change in these ----------------------
domains will be the impact of changing one thing will lead to a greater domino
effect (changing one thing leading to a change in ‘N’ things following it) in ----------------------
finance domain than that in retail domain. Thus, domain expertise is very
----------------------
integral part in understanding data science.
Statistics ----------------------
The most integral part of Data Science is Statistics. When we say ----------------------
statistics, it does not purely mean Mathematics. Statistics deals with a host
of things starting from mean till complex integral and differential calculus ----------------------
problems. The modelling of data, the predictions, recommendations and every ----------------------
other actionable action in the world of data science is dependent on statistics.
But it does not mean a data scientist needs to have knowledge about the subject ----------------------
Basics of Data Science 5
Notes like a statistician. A Data Scientist should be well versed in the basic concepts
and algorithms and should be a quick learner so as to know the working of any
---------------------- modelling/algorithmic technique.
---------------------- Software Programming
Software Programming is again one of the main aspects of Data Science
----------------------
community. Even if you understand the problems, you know which algorithm
---------------------- to apply; you need to know programming concepts so as to execute it. Most of
the coding in the world of Data Science is done in R & Python. Other languages
---------------------- such as Scala, Matlab & SAS are also used. One of the upcoming languages,
which are changing the field of Data Science, is ‘Julia’. If one can get a grip
----------------------
of writing a good code in any of these languages, then it is pretty easy to learn
---------------------- any new language as only the syntax changes. Along with this as R & Python
are open source languages, the Data Scientists can get a lot of queries answered
---------------------- from the community.
---------------------- Communication Skills
---------------------- Out of all the branches in Data Science, communication is one of the
most important fields. It is very important for a Data Scientist to explain what
---------------------- the action he/she seems fit for the problem at hand. Not only this, but a data
scientist tasks also involves explaining a statistical model to a lay man who has
---------------------- very little or near no know how of how algorithms work. This also needs to be
---------------------- coupled with the presentation or the workable solution at hand. A data scientist
should be able to express his approach and ideas to a problem in the least
---------------------- complex manner. He/she also needs to explain why that respective approach
was chosen or rejected based on the data at hand. Thus it is very hard for Data
---------------------- Scientists to explain the processes that they have taken up in solving a problem
---------------------- and good communication skills can go a long way in making the conversation
much easier with the business.
----------------------
1.5 APPLICATIONS & CASE STUDIES
----------------------
---------------------- As stated by Harvard University, Data Scientist is the sexiest job of 20th
century. As discussed in the previous sections of this chapter, we now know that
---------------------- Data Science has stemmed from multiple disciplines. Data has become the new
oil and as a result Data Science has many applications. Some of the domains
---------------------- where Data Science has wide variety of applications are as follows:
---------------------- 1. Banking
---------------------- 2. Retail & E-Commerce
3. Finance
----------------------
4. Healthcare
----------------------
Other domains where Data Science is widely used is Transportation,
---------------------- Manufacturing, Telecom & IT.
----------------------
6 Introduction to Data Science, Machine Learning & AI
There are multiple applications of Data Science in each of the sectors Notes
mentioned above. Let us walk through one example of each sector
----------------------
1) Banking
Banking is one of the biggest sectors having Data Science applications. ----------------------
Coupled with Big Data Technologies, it has enabled banks to avoid frauds,
----------------------
manage resources efficiently, personalize customer recommendations
& provide real time analytics. One of the biggest applications of Data ----------------------
Science across banks is the Risk Score calculation which can be used to
avoid frauds. In this application, an algorithm generates a risk score for ----------------------
each and every individual who has taken a loan, just made a transaction
----------------------
or has defaulted on a bill. The algorithm analyses the past history of the
individual, their borrowing & lending patters, investment amount, growth ----------------------
rate etc. After taking into picture all these factors, the banks are able to get
a risk score related to a particular customer. This helps them mitigate and ----------------------
prevent frauds by taking corrective actions in the respective directions.
----------------------
2) Retail & E-Commerce
----------------------
From Walmart to D-mart, each and every retail shop is engulfed in the
world of Data Science. Introduction of Data Science has revolutionised ----------------------
the retail & commerce industry. Transportation, logistics, and forecasting
are some of the areas where Data Science algorithms have made a huge ----------------------
impact on the industry. Retail & Ecommerce industries have now less out ----------------------
of stocks, better goods management and personalised recommendation
system due to Data Science. One such example would be that of Amazon, ----------------------
where for every user the home page looks different. This is because the
interests of each & every person are different. Similarly, what they have ----------------------
searched on Google along with their past history also paves way to the ----------------------
recommendations the users get. Along with this, the retails giants are
now able to manage their inventory pretty well because of forecasting ----------------------
algorithms which can predict with very high accuracy what will be the
sales of a particular product across city, region, zip code & state. As a ----------------------
result of all these applications, the retail and e-commerce industries have ----------------------
completely changed.
3) Finance ----------------------
Data Science has played a key role in automating various financial tasks. ----------------------
Just like how banks have automated risk analytics, finance industries
have also used data science for this task. Financial industries need to ----------------------
automate risk analytics in order to carry out strategic decisions for the ----------------------
company. Using machine learning, they identify, monitor and prioritize
the risks. These machine learning algorithms enhance cost efficiency ----------------------
and model sustainability through training on the massively available
customer data. Similarly, financial institutions use machine learning for ----------------------
predictive analytics. It allows the companies to predict customer lifetime ----------------------
value and their stock market moves. Nowadays companies are being built
on predicting the next moves in stock markets. Data Science algorithms ----------------------
Basics of Data Science 7
Notes weigh in various factors like new, stock trend, overall psychology of
people, technicals, stock performance, and various such factors to give
---------------------- you a recommendation whether the stock will move up or down in long
and short term. Thus collection of data across various platforms and
---------------------- sources has made financial institutions take data driven decisions. This in
---------------------- turn has also helped customers as they get personalized recommendations
along with better quality of experience.
----------------------
4) Healthcare
---------------------- Healthcare is one of the leading domains in the world of Data Science.
Everyday new experiments and predictions are being done so as to
----------------------
predict heart attacks, cure diseases & mitigate risks. From medical image
---------------------- analysis to genomics, data science is influencing the world of Healthcare.
As an example, the recently launched Apple Watch can predict the risk of
---------------------- Heart Attack of a person prior to two days of on-setting. This has enabled
the medical professionals to carry out more research into his field which
----------------------
will guide them to better the health conditions of the patients. Another
---------------------- application of Data Science in the world of Data Science is that of drug
discovery. In drug discovery, new candidate medicines are formulated.
---------------------- Drug Discovery is a tedious and often complex process. Data Science
can help us to simplify this process and provide us with an early insight
----------------------
into the success rate of the newly discovered drug. With Data Science,
---------------------- we can also analyse several combinations of drugs and their effect on
different gene structure to predict the outcome. Thus Data Science is
---------------------- taking Healthcare to new heights and that has improved the patient health
in general.
----------------------
---------------------- 1.6 CORE OF DATA SCIENCE - MACHINE LEARNING
ALGORITHMS
----------------------
The first question that you might be asking yourself is that how do all
---------------------- these projects in Data Science work. The core solution of nearly every problem
in Data Science is Machine Learning.
----------------------
Now this pegs the question: What is Machine Learning?
----------------------
Machine learning is often a big part of a “data science” project, e.g.,
---------------------- it is often heavily used for exploratory analysis and discovery (clustering
algorithms) and building predictive models (supervised learning algorithms).
---------------------- However, in data science, you often also worry about the collection, wrangling,
and cleaning of your data (i.e., data engineering), and eventually, you want to
----------------------
draw conclusions from your data that help you solve a particular problem.
---------------------- The term machine learning is self-explanatory. Machines learn to perform tasks
that aren’t specifically programmed to do. Many techniques are put into practice
----------------------
like supervised clustering, regression, naive Bayes, etc.
---------------------- Alpavdin defines Machine Learning as-
---------------------- “Optimising a performance criterion using example data and past experience”.
8 Introduction to Data Science, Machine Learning & AI
Machine Learning is concerned with giving machines the ability to learn Notes
by training algorithms on a huge amount of data. It makes use of algorithms &
statistical models to perform a task without needing explicit instructions. The ----------------------
name machine learning was coined in 1959 by Arthur Samuel.
----------------------
Machine learning generally involves 2 steps:
----------------------
In the first, the human writes “learning” code that finds patterns in data,
identifies which patterns are similar, and reports that similarity (knowledge) in ----------------------
a useful way.
----------------------
In the second step, the human writes more code that uses that knowledge,
so that when new data is encountered, this “predicting” code can anticipate the ----------------------
value of the data or interpret it as having meaning, in context to what is already
known. ----------------------
The main objective of machine learning algorithms is to solve certain ----------------------
problems. These “problems” come under the different types of machine learning.
----------------------
Machine learning is divided into three main categories.
1) Supervised Learning ----------------------
2) Unsupervised Learning ----------------------
3) Reinforcement Learning ----------------------
----------------------
----------------------
----------------------
----------------------
Supervised Learning:
----------------------
Here, the system is trained using past data (which includes input and
output), and is able to take decisions or make predictions, when new data is ----------------------
encountered.
----------------------
It is called supervised — because there is a teacher or supervisor.
----------------------
Suppose you have provided a data set consisting of bikes and cars. Now,
you need to train the machine on how to classify all the different images. You ----------------------
can train it like this:
----------------------
If there is 2 number of wheels and 1 headlight on the front it will be
labeled as a bike. ----------------------
If there is 4 number of wheels and 2 headlights on the front it will be ----------------------
labeled as a car.
Now, let’s say that after training the data, there is a new separate image ----------------------
(say Bike) from the bunch and you need to ask the machine to identify it. ----------------------
Since, your machine has already learned the things, it needs to use that
----------------------
knowledge. The machine will classify the Image regarding the presence or
Basics of Data Science 9
Notes absence of a number of wheels and number of Headlights and would label the
image name as Bike.
----------------------
Unsupervised Learning:
---------------------- The system is able to recognise patterns, similarities and anomalies,
taking into consideration only the input data.
----------------------
Unlike supervised learning, in this, the result is not known, we approach
---------------------- with little or no knowledge of what the result would be, the machine is expected
to find the hidden patterns and structure in unlabelled data on their own.
----------------------
That’s why it is called unsupervised — there is no supervisor to teach the
---------------------- machine.
---------------------- Reinforcement Learning:
---------------------- Decisions are made by the system on the basis of the reward/punishment
it received for the last action it performed.
----------------------
Reinforcement Learning is an aspect of Machine learning where an agent
---------------------- learns to behave in an environment, by performing certain actions and observing
the rewards/results which it get from those actions.
----------------------
---------------------- 1.7 THE DATA SCIENCE LIFE CYCLE
---------------------- The Data Science Life Cycle consists of
1) Identify the problem
----------------------
2) Identify available data sources
----------------------
3) Identify if additional data sources are needed
---------------------- 4) Statistical analysis
---------------------- 5) Implementation, development
---------------------- 6) Communicate results
7) Maintenance
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
10 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
1) Identify the problem: ----------------------
- Identify metrics used to measure success over baseline (doing ----------------------
nothing)
----------------------
- Identify type of problem: prototyping, proof of concept, root cause
analysis, predictive analytics, prescriptive analytics and machine- ----------------------
to-machine implementation
----------------------
- Identify key people within your organisation and outside
----------------------
- Get specifications, requirements, priorities, and budgets
- How accurate the solution needs to be? ----------------------
- Decide what data you need ----------------------
- Built internally versus using a vendor solution ----------------------
- Vendor comparison, benchmarking
----------------------
2) Identify available data sources:
----------------------
- Extract (or obtain) and check sample data (use sound sampling
techniques); discuss fields to make sure data is understood by you ----------------------
- Perform EDA (exploratory analysis, data dictionary)
----------------------
- Assess quality of data, and value available in data
- Identify data glitches, find work-around ----------------------
- Is quality and fields populated consistent over time? ----------------------
Basics of Data Science 11
Notes - Are some fields a blend of different stuff (example: keyword field,
sometimes equal to user query, sometimes to advertiser keyword,
---------------------- with no way to know except via statistical analyses or by talking to
business people)
----------------------
- How to improve data quality moving forward
----------------------
- Do I need to create mini summary tables / database?
---------------------- - Which tool do I need (R, Excel, Tableau, Python, Perl, Tableau,
SAS)
----------------------
3) Identify if additional data sources are needed:
----------------------
- What fields should be captured?
---------------------- - How granular?
---------------------- - How much historical data?
---------------------- - Do we need real time data?
---------------------- - How to store or access the data? (NoSQL? Map-Reduce?)
- Do we need experimental design?
----------------------
4) Statistical analysis:
----------------------
- Use imputation methods as needed
---------------------- - Detect/remove outliers
---------------------- - Selecting variables (variables reduction)
---------------------- - Is the data censored (hidden data, as in survival analysis or time-to-
crime statistics)
---------------------- - Cross-correlation analysis
---------------------- - Model selection (as needed, favour simple models)
---------------------- - Sensitivity analysis
- Cross-validation, model fitting
----------------------
- Measure accuracy, provide confidence intervals
----------------------
5) Implementation, development:
---------------------- - FSSRR: Fast, simple, scalable, robust, re-usable
---------------------- - How frequently do I need to update lookup tables, white lists, data
uploads, and so on
----------------------
- Debugging
----------------------
- Need to create an API to communicate with other apps
---------------------- 6) Communicating results
---------------------- - To whom do I have to communicate the results?
---------------------- - Is the business person someone who understands statistics or not?
12 Introduction to Data Science, Machine Learning & AI
- How does my solution affect the business and what is the impact it Notes
can make?
----------------------
7) Maintenance
- Is the data shaping up as it was previously? ----------------------
- Is the statistical model used previously still showing accurate ----------------------
numbers?
----------------------
- Is there any other way the code can be optimised?
The data science workflow is a non-linear and iterative task which requires ----------------------
many skills and tools to cover the whole process. ----------------------
From framing your business problem to generating actionable insights it
is a complex process which requires a lot of brain storming and iterative model ----------------------
building tasks. ----------------------
Check your Progress 1 ----------------------
----------------------
Answer the following.
1. What are the components of Data Science? ----------------------
2. In which fields does Data Science have its applications? ----------------------
3. What are the components of Data Science Life Cycle? ----------------------
4. What is Machine Learning?
----------------------
5. What are the components of Machine Learning?
----------------------
----------------------
Activity 1
----------------------
Give one example in your day to day life where you see data science ----------------------
algorithms getting applied.
----------------------
----------------------
Summary
----------------------
●● Data Science is a combination of multiple fields which involves creation,
preparation, transformation, modelling, and visualisation of data. ----------------------
●● Data Science consists of 4 major components which are domain knowledge ----------------------
expertise, software programming, statistics, & communications skills.
●● About a decade ago as there was a lot of cost associated with storage of data, analysis ----------------------
on the past data was not possible. Along with this, the lack of computing power in
machines also made things a lot hard. ----------------------
●● Data Science has its applications in each and every field. Right from ----------------------
mobiles to predicting a heart attack, Data Science is being used to create
a better living. ----------------------
Basics of Data Science 13
Notes ●● Data Scientists are the ones which predominantly do most Data Science
work. They are integral to modelling and transforming the data.
---------------------- ●● The job of Data Scientists can vary from project to project and from
---------------------- organisation to organisation. They do not have a fixed set of rules under
which their work lies.
---------------------- ●● The core part of Data Science is Machine Learning. Machine Learning
---------------------- consists of modelling the data to get some feasible output from it.
●● Machine Learning consists of making the machine learn or training it so
---------------------- that it can predict the outcome from previous data points.
---------------------- ●● Machine Learning can be categorised in 3 major parts which are
Supervised, Unsupervised & Reinforcement Learning.
----------------------
●● Data Science can be defined into a work cycle which can be segregated
---------------------- in seven different parts associated with problem identification, data
gathering & transformation, statistical modelling, communicating the
---------------------- results, and maintenance.
----------------------
Keywords
----------------------
●● Data Science: The art of solving business problems coupled with statistics
---------------------- & software programming.
---------------------- ●● Machine Learning: The ability of a machine to learn from past data to
predict an outcome for the future.
----------------------
●● Supervised Learning: A learning in which the answer to some of the
---------------------- data points is already known.
●● Unsupervised Learning: A learning in which answer to data points is
----------------------
not known or learning in which there is no supervisor or teacher.
---------------------- ●● Reinforcement Learning: A learning which is based on punishment and
reward ratio. A right move leads to reward and a wrong move leads to
---------------------- punishment.
----------------------
Self-Assessment Questions
----------------------
1. What is Data Science & what are its applications?
----------------------
2. What are algorithms and how are they used in Data Science field?
----------------------
3. Why are we studying Data Science as a subject now & why didn’t we
---------------------- study it earlier?
---------------------- 4. What is the Data Science Life Cycle?
5. Which types of Learning do you see around you? Give one example of
----------------------
each and explain how that effects your day to day life.
---------------------- 6. What is Machine Learning?
----------------------
14 Introduction to Data Science, Machine Learning & AI
Answers to Check your Progress Notes
Check your Progress 1 ----------------------
1. Components of Data Science are: ----------------------
a. Domain Knowledge Expertise
----------------------
b. Software Programming
----------------------
c. Statistics
d. Communication Skills ----------------------
2. Data Science has its applications in : ----------------------
a. Retail & Ecommerce ----------------------
b. Healthcare & IT
----------------------
c. Transport, Logistics & Manufacturing
----------------------
d. Banking & Finance
3. Components of Data Science Life Cycle are: ----------------------
a. Identifying the problem ----------------------
b. Gathering Data ----------------------
c. Statistical Analysis
----------------------
d. Implementation & Development
----------------------
e. Communicating Results
f. Maintenance ----------------------
4. Machine Learning can be defined as: ----------------------
Machine learning is an application of artificial intelligence (AI) that ----------------------
provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine learning ----------------------
focuses on the development of computer programs that can access data
and use it learn for themselves. ----------------------
5. Components of ML are: ----------------------
a. Supervised Learning ----------------------
b. Unsupervised Learning
----------------------
c. Reinforcement Learning
----------------------
----------------------
----------------------
----------------------
----------------------
Basics of Data Science 15
Notes
Suggested Reading
----------------------
1. Jeffrey Stanton, An Introduction to Data Science.
---------------------- 2. The Elements of Statistical Learning, Book by Jerome H. Friedman,
Robert Tibshirani, and Trevor Hastie
----------------------
3. The Hundred-page Machine Learning Book, Book by Andriy Burkov
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
16 Introduction to Data Science, Machine Learning & AI
Big Data, Datafication & its Impact on Data Science
UNIT
2
Structure:
2.1 Introduction to Big Data
2.2 Big Data, What is it?
2.3 Big Data & Data Science
2.4 Big Data Technologies
2.5 Datafication
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Big Data, Datafication & its Impact on Data Science 17
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand what is meant by Big Data
----------------------
• Make connections between Big Data Problems & Data Science
----------------------
---------------------- 2.1 INTRODUCTION TO BIG DATA
---------------------- While we keep on ordering a cab for us to travel, or order a parcel from
---------------------- Amazon we never think how much easy it is for us to be able to sit in our room
and do these things. Although a couple of years ago, this was not the case. Cabs
---------------------- had to be called by hand signals on the road and we ourselves did not trust
ordering an item from online stores because of some reason or other. But as we
---------------------- have evolved, we have had these comforts because storing and analysing data
---------------------- has gotten cheaper and faster. But along with this, what has also changed is
how much amount of data we are generating. Everybody these days has social
---------------------- media accounts, 2 tablets and a phone, a laptop etc. and we are continuously
generating data. As all of these data points are getting generated, they are
---------------------- getting stored parallel as well. Have you ever wondered how Amazon might
---------------------- be managing its data from all users and storing? How much big data space will
they be requiring? And even if they manage to store say such huge amounts of
---------------------- data, won’t it take forever to do analysis on it? These are some of the questions
that we might ask ourselves and a decade back all of these questions answers
---------------------- would have been “No, it is not possible”. But we have come a long way from it.
---------------------- The answer to all the above questions is yes and that is only possible
because of Big Data applications and its technological advancements. Then this
---------------------- raises the question what is this Big Data and why is it getting such importance?
---------------------- “Big Data is a phrase used to mean a massive volume of both structured
and unstructured data that is so large it is difficult to process using traditional
----------------------
database and software techniques. In most enterprise scenarios the volume of
---------------------- data is too big or it moves too fast or it exceeds current processing capacity.”
– Webopedia
----------------------
Thus in short, Big Data is nothing but huge amounts of data produced
---------------------- which our traditional SQL databases cannot handle or store. Plus it is not
compulsory that all of this data getting generated is structured on unstructured.
---------------------- A Big Data system is capable of handling both plain/random text (Unstructured
Data) and a tabular format data (Structured). As a result of this, this can help
----------------------
companies and conglomerates to take data driven intelligent decisions.
----------------------
----------------------
----------------------
18 Introduction to Data Science, Machine Learning & AI
2.2 BIG DATA, WHAT IS IT? Notes
The below figure is one of the most common figures you will see when ----------------------
you Google about Big Data. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Figure Source - https://blog.unbelievable-machine.com/en/what-is-big-data-definition-
Generally, Big Data consists of 5 major V’s which help us in understanding, ----------------------
managing and transforming Big Data. These are ----------------------
a) Volume
----------------------
b) Variety
----------------------
c) Value
d) Velocity ----------------------
e) Veracity ----------------------
Let us talk about each one of them in brief so as to understand what each ----------------------
means and how is it important.
----------------------
a) Volume
Volume is nothing the amount of data that needs to be stored. Let us ----------------------
say you are building a product which will capture data from all users
----------------------
using a particular browser once you have asked for their permission.
Supposedly there are only 5 users at start and they slowly start growing. ----------------------
The database which you had used at the start like Excel, SQL will not be
enough as the users multiply the volume of data which will be coming in ----------------------
will be humongous and it will not be possible to store it in the same old/
----------------------
traditional database. This is where volume plays a pivotal role in deciding
the technology that needs to be used for the product back end. ----------------------
b) Variety ----------------------
The data that we see in Excel or any other database for that matter is
mostly structured. In the world of Big Data, it can be anything. You can ----------------------
store text, images and voice notes into a database. This creates variety ----------------------
amongst the data types you are going to store. This variety is a driving
force is selecting the Big Data technology. ----------------------
----------------------
Big Data, Datafication & its Impact on Data Science 19
Notes c) Value
Not all the data which will be stored will be valuable to us. But does that
----------------------
mean we should not store invaluable data? Sometimes, a particular type/
---------------------- column in dataset might not be useful but that does not mean it won’t be
used in the future. That is reason why the database should be built in such
---------------------- a way that it is easy to segregate valuable & invaluable data from the
source as well as it should be easy to combine when required.
----------------------
d) Velocity
----------------------
In today’s world, speed is the name of the game. If analytics can be
---------------------- done real time it changes a lot of things. Thus it is imperative to know
beforehand the speed at which you need to collect and store data, the
---------------------- timeframe of data capturing and the way in which it should be processed.
This is the most important part of Big Data.
----------------------
e) Veracity
----------------------
Most real world datasets are not perfect. There are mostly a lot of
---------------------- inconsistencies in the data when it is captured. It is important to replace
those inconsistencies with suitable data values. You will often find missing
---------------------- values, wrong data types in wrong columns etc. in the real world data
---------------------- that will be captured. You should be able to transform and replace these
missing values in the dataset with imputed ones easily.
----------------------
2.3 BIG DATA & DATA SCIENCE
----------------------
---------------------- Data science is an interdisciplinary field that combines statistics,
mathematics, data acquisition, data cleansing, mining and programming to
---------------------- extract insights and information. When data sets get so big that they cannot be
analysed by traditional data processing application tools, it becomes ‘Big Data’.
---------------------- That massive amount of data is useless if it is not analysed and processed.
---------------------- Hence, big data and data science are inseparable.
---------------------- Both Data Science and Big Data are related to Data Driven Decision but
are significantly different.
----------------------
Data Driven Decision (with the expectations of better decision and
---------------------- increase value) is process and involves different stages like
1) Capturing of Data
----------------------
2) Processing & Storing Data
----------------------
3) Analysing and Generating Insights
---------------------- 4) Decision & Actions
---------------------- Big Data is typically involved in processing and storing the data (Step 2)
and that too in all the scenarios. Big Data & Technology helps in reducing cost
----------------------
in processing volume of data and also making it feasible to do a few typically
---------------------- analyses.
20 Introduction to Data Science, Machine Learning & AI
Data Science is involved in analysing and generating insights (Step 3). It Notes
involves in using Statistical, Mathematical and Machine Learning algorithms to
use data and generate insights. Whether a data is “Big data” or not, we can use ----------------------
Data Science to support Data Driven Decisions and take better decisions.
----------------------
For example,
----------------------
If you want to mail your friend a 100 Mb file, the mail system will not
allow it. So for the mail system, this file will be “Big Data”. But if you consider ----------------------
the same file to be uploaded to any cloud drive, you would be able to do that
easily. Hence, the definition of Big Data changes from system to system. ----------------------
Big Data is the fuel required by Data Scientist to do Data Science. ----------------------
Some of the technologies which work with Big Data are Hadoop, Apache ----------------------
Spark etc.
----------------------
2.4 BIG DATA TECHNOLOGIES
----------------------
Nearly every industry has begun investing in big data analytics, but some
----------------------
are investing more heavily than others. According to IDC, banking, discrete
manufacturing, process manufacturing, federal/central government, and ----------------------
professional services are among the biggest spenders. Together those industries
will likely spend $72.4 billion on big data and business analytics in 2017, ----------------------
climbing to $101.5 billion by 2020.
----------------------
The fastest growth in spending on big data technologies is occurring
within banking, healthcare, insurance, securities and investment services, and ----------------------
telecommunications. ----------------------
It’s noteworthy that three of those industries lie within the financial sector,
which has many particularly strong use cases for big data analytics, such as ----------------------
fraud detection, risk management and customer service optimisation. ----------------------
The list of technology vendors offering big data solutions is seemingly
infinite. ----------------------
Many of the big data solutions that are particularly popular right now fit ----------------------
into one of the following 5 categories:
----------------------
1) Hadoop Ecosystem
----------------------
2) Apache Spark
3) Data Lakes ----------------------
4) NoSQL Databases ----------------------
5) In-Memory Databases ----------------------
Let’s look at each one in detail.
----------------------
1) Hadoop Ecosystem
----------------------
Let’s first learn Hadoop and its ecosystem, then automatically you will
get the idea that what is Hadoop and its Ecosystems. ----------------------
Big Data, Datafication & its Impact on Data Science 21
Notes Hadoop is an open source, Scalable, and Fault tolerant framework written
in Java. It efficiently processes large volumes of data on a cluster of
---------------------- commodity hardware.
---------------------- Hadoop is not only a storage system but is a platform for large data storage
as well as processing.
----------------------
Hadoop is an open-source tool from the ASF – Apache Software
---------------------- Foundation. Open source project means it is freely available and we can
even change its source code as per the requirements. If certain functionality
---------------------- does not fulfil your need then you can change it according to your need.
Most of Hadoop code is written by Yahoo, IBM, Facebook and Cloudera.
----------------------
It provides an efficient framework for running jobs on multiple nodes of
---------------------- clusters.
---------------------- Cluster means a group of systems connected via LAN. Apache Hadoop
provides parallel processing of data as it works on multiple machines
---------------------- simultaneously.
---------------------- Hadoop consists of three key parts –
---------------------- 1) Hadoop Distributed File System (HDFS) – It is the storage layer of
Hadoop.
----------------------
2) Map-Reduce – It is the data processing layer of Hadoop.
---------------------- 3) YARN – It is the resource management layer of Hadoop.
---------------------- Let’s go into details of each one of these.
---------------------- 1) HDFS
HDFS is a distributed file system which is provided in Hadoop as a
---------------------- primary storage service. It is used to store large data sets on multiple
---------------------- nodes. HDFS is deployed on low cost commodity hardware.
So, if you have ten computers where each of the computer (node) has a
----------------------
hard drive of 1 TB and you install Hadoop on top of these ten machines,
---------------------- you get a storage capacity of 10 TB in total. So, it means that you can
store single file of 10 TB in HDFS which will be stored in a distributed
---------------------- fashion on these ten machines.
---------------------- There are many features of HDFS which makes it suitable for storing
large data like scalability, data locality, fault tolerance etc.
----------------------
2) Map Reduce
---------------------- Map Reduce is the processing layer of Hadoop. Map Reduce programming
---------------------- model is designed for processing large volumes of data in parallel by
dividing the work into a set of independent tasks.
---------------------- You need to put business logic in the way Map Reduce works and rest
---------------------- things will be taken care by the framework. Work (complete job) which
is submitted by the user to master is divided into small works (tasks) and
---------------------- assigned to slaves.
22 Introduction to Data Science, Machine Learning & AI
Map Reduce programs are written in a particular style influenced by Notes
functional programming constructs, specifically idioms for processing
lists of data. ----------------------
In Map Reduce, we get inputs from a list and it converts it into output ----------------------
which is again a list. It is the heart of Hadoop. Hadoop is so much powerful
and efficient due to Map Reduce as here parallel processing is done. ----------------------
3) YARN ----------------------
Apache Yarn – “Yet another Resource Negotiator” is the resource
----------------------
management layer of Hadoop. The Yarn was introduced in Hadoop
2.x. Yarn allows different data processing engines like graph processing, ----------------------
interactive processing, stream processing as well as batch processing to
run and process data stored in HDFS (Hadoop Distributed File System). ----------------------
Apart from resource management, Yarn is also used for job Scheduling. ----------------------
Yarn extends the power of Hadoop to other evolving technologies, so
they can take the advantages of HDFS (most reliable and popular storage ----------------------
system on the planet) and economic cluster. ----------------------
Apache yarn is also considered as the data operating system for Hadoop
2.x. The yarn based architecture of Hadoop 2.x provides a general purpose ----------------------
data processing platform which is not just limited to the Map Reduce. ----------------------
It enables Hadoop to process other purpose-built data processing system
other than Map Reduce. It allows running several different frameworks ----------------------
on the same hardware where Hadoop is deployed. ----------------------
Now that we have understood what is Hadoop let’s try and understand what is
----------------------
the Hadoop Ecosystem.
The Hadoop ecosystem refers to the various components of the Apache Hadoop ----------------------
software library, as well as to the accessories and tools provided by the Apache
----------------------
Software Foundation.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Figure Source: https://www.oreilly.com/library/view/apache-hiveessentials/9781788995092/
e846ea02-6894-45c9-983a-03875076bb5b.xhtml ----------------------
Big Data, Datafication & its Impact on Data Science 23
Notes The above figure shows the various components of Hadoop ecosystem.
Some of the components are explained as follows:
----------------------
a) Hive
---------------------- Apache Hive is an open source data warehouse system for querying
and analysing large datasets stored in Hadoop files. Hive do three main
----------------------
functions:
---------------------- Data summarisation Query Processing Analysis
---------------------- Hive use language called HiveQL (HQL), which is similar to SQL.
---------------------- HiveQL automatically translates SQL-like queries into Map Reduce jobs
which will execute on Hadoop.
---------------------- b) Pig
---------------------- Apache Pig is a high-level language platform for analysing and querying
huge dataset that are stored in HDFS.
----------------------
Pig uses Pig Latin language. It is very similar to SQL. It loads the data,
---------------------- applies the required filters and dumps the data in the required format.
---------------------- For Programs execution, pig requires Java runtime environment.
---------------------- c) HBase
Apache HBase is distributed database that was designed to store structured
---------------------- data in tables that could have billions of row and millions of columns.
---------------------- HBase is scalable, distributed, and NoSQL database that is built on top of
HDFS. HBase provides real time access to read or write data in HDFS.
----------------------
d) HCatalog
----------------------
It is a table and storage management layer for Hadoop.
---------------------- HCatalog supports different components available in Hadoop like Map
---------------------- Reduce, Hive, and Pig to easily read and write data from the cluster.
HCatalog is a key component of Hive that enables the user to store their
---------------------- data in any format and structure.
---------------------- e) Avro
It is a most popular Data serialization system.
---------------------- Avro is an open source project that provides data serialization and data
---------------------- exchange services for Hadoop. These services can be used together or
independently.
---------------------- Big data can exchange programs written in different languages using
---------------------- Avro.
Above mentioned services are the ones which are generally present in Hadoop
---------------------- Ecosystem. It is not compulsory that each of these technologies will be required
always.
----------------------
Thus Hadoop is a very important part of Big Data and most of it being open
---------------------- source, can be modified as per requirement.
24 Introduction to Data Science, Machine Learning & AI
2) Apache Spark Notes
Apache Spark is an open-source cluster-computing framework.
----------------------
Apache Spark is a general-purpose & lightning fast cluster computing
system. It provides high-level API. For example, Java, Scala, Python and ----------------------
R.
----------------------
Apache Spark is a tool for Running Spark Applications. Spark is 100
times faster than Big Data Hadoop and 10 times faster than accessing data ----------------------
from disk.
----------------------
Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later
it becomes AMP Lab. It was open sourced in 2010 under BSD license. ----------------------
In 2013, spark was donated to Apache Software Foundation where it ----------------------
became top-level Apache project in 2014. It was built on top of Hadoop
Map Reduce and it extends the Map Reduce model to efficiently use more ----------------------
types of Computations. ----------------------
Spark can be used along with Map Reduce in the Same Hadoop cluster or
can be used alone as a processing framework. Also Spark application can ----------------------
run on YARN. ----------------------
Apache Spark framework can be implemented in Java, R, Python and
----------------------
Scala. However, Scala Programming is the most favourable one because-
a) It offers great scalability on JVM ----------------------
b) Performance achieved using Scala is better than that of data analysis ----------------------
tools like R or Python
----------------------
c) It has excellent built-in concurrency support and libraries like Akka,
making it easy to build a scalable application ----------------------
d) A single complex line of code in Scala can replace 20–25 lines of ----------------------
complex java code
----------------------
e) Scala is fast and efficient making it an ideal choice for computationally
intensive algorithms ----------------------
Some important features of Spark are:
----------------------
1) Real-time: Real-time computation and low latency because of in-
memory computation ----------------------
2) Speed: 100x faster for large scale data processing ----------------------
3) Ease of Use: Applications can be written easily in Java, Scala, ----------------------
Python, R and SQL
4) Generality: Combines SQL, streaming and complex analytics ----------------------
5) Deployment: Easily deployable through Mesos, Hadoop via YARN ----------------------
or Spark’s own cluster manager
----------------------
6) Powerful Caching: Provides powerful caching and disk persistence
capabilities ----------------------
Big Data, Datafication & its Impact on Data Science 25
Notes Apache Pyspark
Pyspark is nothing but python for Spark.
----------------------
Pyspark is one of the supported language for Spark. Spark is a big data
---------------------- processing platform, provides capability to process petabyte scale data.
---------------------- Using Pyspark you can write spark application to process data and run it
on Spark platform. AWS provides managed EMR, spark platform.
----------------------
Using Pyspark you can read data from various file format like csv, parquet,
---------------------- Json or from databases and do analysis on top of it.
---------------------- It is because of such features why Spark is widely preferred in industry
these days. Whether it is start-ups or Fortune 500s, all are adopting Apache
---------------------- Spark to build, scale and innovate their applications.
---------------------- Spark has left no area of Industry untouched whether it is finance or
entertainment, it is being widely used everywhere.
----------------------
3) Data Lakes
---------------------- A data lake is a reservoir which can store vast amounts of raw data in its
native format. This data can be –
----------------------
1) Structured data from relational databases (rows and columns),
----------------------
2) Structured data from NoSQL databases (like MongoDB, Cassandra, etc.),
---------------------- 3) Semi-structured data (CSV, logs, XML, JSON),
---------------------- 4) Unstructured data (emails, documents, PDFs) and Binary data
(images, audio, video).
----------------------
The purpose of a data lake, a capacious and agile platform is to hold all
---------------------- the data of an enterprise at a central platform.
---------------------- By this, we can do comprehensive reporting, visualisation, analytics and
eventually glean deep business insights.
----------------------
But keep in mind that Data Lakes and Data Warehouse are different
---------------------- things.
---------------------- Contrary to a data warehouse, where data is processed and stored in files
and folder, a data lake has a flat architecture, meaning that a data lake
---------------------- stores all the data without any prior processing done, reducing the time
required for compilation. The data in a data lake is retained in its original
---------------------- format, until it is needed.
---------------------- Data lakes provides agility and flexibility, making it easier to make
changes. Though the reason to store data in a data lake is not predefined,
---------------------- the main objective of building a data lake is to offer an unrefined view of
---------------------- data to data scientists, whenever needed.
Data Lake also allows Ingestion i.e. connectors to get data from different
----------------------
data sources to be loaded into the Data Lake. Data lake storage is more
---------------------- scalable and cost efficient and allows fast data exploration.
26 Introduction to Data Science, Machine Learning & AI
If not designed correctly, Data Lake can soon become toxic. Some of the Notes
guiding principles for designing Data Lake are:
----------------------
Data within the data lake is stored in the same format as that of the source.
The idea is to store data quickly with minimal processing to make the ----------------------
process fast and cost efficient.
----------------------
Data within the data lake is reconciled with the source every time a new
data set is loaded, to ensure that it is a mirror copy of data inside the ----------------------
source.
----------------------
Data within the data lake is well documented to ensure correct
interpretation of data. Data catalogue and definitions are made available ----------------------
to all authorised users through a convenient channel.
----------------------
Data within the data lake can be traced back to its source to ensure integrity
of data. ----------------------
Data within the data lake is secured through a controlled access mechanism. ----------------------
It is generally made available to data analysts and data scientists to explore
further. ----------------------
Data within the data lake is generally large in volume. The idea is to store ----------------------
as much data as possible, without worrying about which data elements
are going to be useful and which are not. This enables an exploratory ----------------------
environment, where users can keep looking at more data and build reports
or analytical models in an incremental fashion. ----------------------
Data within the data lake is stored in the form of daily copies of data ----------------------
so that previous versions of data can be easily accessed for exploration.
----------------------
Accumulation of historic data overtime enables companies to do trend
analysis as well as build intelligent machine learning models that can ----------------------
learn from previous data to predict outcomes.
----------------------
Data within the data lake is never deleted.
Data within the data lake is generally stored in open source big data ----------------------
platforms like Hadoop to ensure minimum storage costs. This also enables ----------------------
very efficient querying and processing of large volumes of data during
iterative data exploration and analysis. ----------------------
Data within the data lake is stored in the format that it is received from ----------------------
the source, and is not necessarily structured. The idea is to put minimum
efforts while storing data into the data lake. All efforts to organize and ----------------------
decipher data happens post loading.
----------------------
Thus Data Lakes are now a major part of every enterprise architecture
building process. ----------------------
When a business question arises, the data lake can be queried for relevant ----------------------
data, and that smaller set of data can then be analysed to help answer the
question. ----------------------
----------------------
Big Data, Datafication & its Impact on Data Science 27
Notes 4) In-memory Databases
An in-memory database is a data store that primarily uses the main
---------------------- memory of a computer. Since this main memory has the fastest access
---------------------- time, data stored in main memory affords the most speed for database
applications.
---------------------- Main stream databases, mostly store data in a permanent store (such as
---------------------- a hard disk or network storage), which increases its access time and are
thus not as fast, when compared to in-memory databases.
---------------------- Mission critical applications, which need very fast response times, such as
---------------------- medical and telecom applications always relied on in-memory databases.
However, recent development of memory devices that can fit large
---------------------- amounts of data for a very low price, have made in-memory databases
very attractive to commercial applications as well.
----------------------
In-memory databases generally store data in proprietary forms. There are
---------------------- several open-source in-memory databases that store data in a ‘key-value’
format. So, in that sense, these databases are not similar to traditional
---------------------- relational databases that use SQL.
---------------------- All properly constructed DBMS’s are actually in-memory databases for
query purposes at some level because they really only query data that is
---------------------- in memory, i.e. in their buffer caches. The difference is that a database
that claims to be in-memory will always have the entire database resident
----------------------
in memory from start-up while more traditional databases use a demand
---------------------- loading scheme only copying data from permanent storage to memory
when it is called for.
----------------------
So, even if our Oracle, Informix, DB2, PostGreSQL, MySQL, or MS SQL
---------------------- Server instance has sufficient memory allocated to it to keep your entire
database in memory, the first number of queries will run slower than later
---------------------- queries until all of the data has been called for directly by queries or
pulled into memory by read ahead algorithm activity.
----------------------
A true in-memory database system will have a period at start-up when
---------------------- it will either refuse to respond to queries or will suspend them until the
entire database can be loaded in from storage after which all queries will
----------------------
be served as quickly as possible.
---------------------- 5) NOSQL Databases
---------------------- NoSQL refers to a general class of storage engines that store data in a
non-relational format. This is in contrast to traditional RDBMS in which
---------------------- data is stored in tables that have data that relate to each other. NoSQL
stands for “Not Only SQL” and isn’t meant as a rejection of traditional
---------------------- databases.
---------------------- There’s different kinds of NoSQL databases for different jobs. They can
be categorised broadly into four different buckets:
----------------------
a) Key-Value Stores: Are very simple in that you simply define a key
---------------------- for a binary object. It’s very common for programmers to simple
28 Introduction to Data Science, Machine Learning & AI
store large serialised objects in these kinds of DBs. Examples are Notes
Cassandra (database), and Oracle NoSQL.
----------------------
b) Document Store: Stores “documents” also based on a key-value
system although more structured. The most common implementation ----------------------
is based on the JSON (JavaScript Object Notation) standard, which
I tend to think of as a similar structure to XML. Examples are ----------------------
MongoDB and Couch DB.
----------------------
c) Graph DB: Stores data as “graphs” which allow you to define
complex relationships between objects. Very common for something ----------------------
like storing relationships of people in a social network. Examples
----------------------
are Neo4j.
d) Column Oriented: Data is stored in columns rather than rows (this ----------------------
is a tricky concept to get at first). Allows for great compression and
----------------------
for building tables that are very large (hundreds of thousands of
columns, billions/trillions of rows). Examples are HBase. ----------------------
In general, NoSQL databases excel when you need something that can ----------------------
both read and write large amounts of data quickly. And since they scale
horizontally, just adding more servers tends to improve performance with ----------------------
little effort. Facebook uses it for you Inbox.
----------------------
Other examples might be a user’s game online profile or storing large
amounts of legal documents. An RDBMS is still the best option for ----------------------
handling large numbers of atomic level transactions (IE, we likely won’t
see things like banking systems or supply chain management systems run ----------------------
on a NoSQL database). ----------------------
This is also because they are not ACID compliant (basically two people
looking at the same key might see different values, depending on when ----------------------
the data was accessed). ----------------------
NoSQL databases are getting used in nearly 90% of our daily applications.
----------------------
They are very important component contributing to the overall Big Data
Structure. ----------------------
2.5 DATAFICATION ----------------------
Datafication is a new concept which refers “how we render into data many ----------------------
aspects of the uncontrollable and qualitative factors into a quantified form”. In ----------------------
other words, this new term represents our ability to collect data for aspects of
our lives that have never been quantified before and turning them into value, ----------------------
e.g. valuable knowledge.
----------------------
Let us try to elaborate this with an example.
----------------------
Every time we try to go to a big store for buying any product, the store
guys ask us to fill a form and get a card of that store. In earlier days, this practice ----------------------
was not present. So what changed?
----------------------
Big Data, Datafication & its Impact on Data Science 29
Notes Earlier when we used to buy certain products, there was no traceability
associated with the products with respect to which product was bought by
---------------------- which person. But now as we buy anything by swiping a card associated with
that store, they associate the product with us. This helps them in sending us
---------------------- offers on our mobile and email.
---------------------- This is DATAFICATION. Earlier where this data was qualitative in
nature was not getting captured. But now with introduction of this card system,
----------------------
we are able to know this for nearly 40% customers.
---------------------- Datafication is not only about the data, but refers to the process of collecting
data, as well as the tools and the technologies that support data collection. In
----------------------
the business context, an organization uses data to monitor processes, support
---------------------- decision-making and plan short- and long-term strategies. Many start-up
companies have been established on the hype of big data by extracting value
---------------------- from them. In a few years, no business will be able to operate without exploiting
the data available, while whole industries may face complete re-engineering.
----------------------
But keep in mind that Datafication is not Digitalization. The later term
---------------------- describes the process of using digital technologies to restructure our society,
businesses and personal lives. It began with the rise of computers and their
----------------------
introduction in organizations. In the following years, new technologies such
---------------------- Internet of Things, have been gradually integrated in our lives and revolutionized
them. However, Datafication represents the next phase of evolution, when data
---------------------- production and proper collection is already a present and the society tends to
establish processes for the extraction of valuable knowledge.
----------------------
---------------------- Check your Progress 1
---------------------- Answer the Questions.
---------------------- 1. What are the 5 V’s in Big Data?
---------------------- 2. What is the connection between Data Science and Big Data?
3. What are different technologies associated with Big Data?
----------------------
4. Which are the main components of Hadoop?
----------------------
5. What is Apache Spark?
---------------------- 6. What are in memory databases?
----------------------
---------------------- Activity 1
----------------------
Give one example in your day to day life where you see DATAFICATION
---------------------- happening.
----------------------
----------------------
30 Introduction to Data Science, Machine Learning & AI
Summary Notes
●● Big Data is nothing but huge data which is governed by 5 V’s. ----------------------
●● Big Data and Data Science go hand in hand. But it doesn’t mean that one ----------------------
cannot survive without another.
----------------------
●● The various technologies associated with Big Data are:
Hadoop ----------------------
Apache Spark ----------------------
In-Memory Databases
----------------------
NoSQL databases
----------------------
Data Lakes
●● Datafication is a new trend which is nothing but trying to find new ways ----------------------
to convert qualitative data of a consumer / customer into quantitative one. ----------------------
Keywords ----------------------
----------------------
●● Big Data: Big data is data that exceeds the processing capacity of
conventional database systems. ----------------------
●● Datafication: Datafication refers to the collective tools, technologies and
processes used to transform an organisation to a data-driven enterprise. ----------------------
----------------------
Self-Assessment Questions
----------------------
1. How is Data Science and Big Data connected? Explain with an example.
----------------------
2. Where do you think Big Data can be most useful in today’s world?
----------------------
Answers to Check your Progress ----------------------
Check your Progress 1 ----------------------
1. 5 V’s of Big Data:
----------------------
a. Volume
----------------------
b. Variety
c. Velocity ----------------------
d. Veracity ----------------------
e. Value ----------------------
2. Connection between Data Science and Big Data: Big Data is the fuel for ----------------------
Data Science
----------------------
----------------------
Big Data, Datafication & its Impact on Data Science 31
Notes 3. Technologies associated with Big Data are
a. Hadoop
----------------------
b. Spark
----------------------
c. Data Lakes
---------------------- d. NoSQL Databases
---------------------- e. In-Memory Databases
---------------------- 4. Main components of Hadoop are:
1) HDFS
----------------------
2) Map Reduce
----------------------
3) YARN
---------------------- 5. Apache Spark is a general-purpose & lightning fast cluster computing
---------------------- platform. It is an open source, wide range data processing engine.
6. An In-Memory Database is a database management system that primarily
---------------------- relies on main memory for computer data storage.
----------------------
Suggested Reading
----------------------
---------------------- 1. Big Data: A Revolution That Will Transform How We Live, Work, and
Think. - Book by Kenneth Cukier and Viktor Mayer-Schönberger.
---------------------- 2. Big Data For Dummies - Book by Alan Nugent, Fern Halper, Judith
---------------------- Hurwitz, and Marcia Kaufman.
3. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities -
---------------------- Book by Thomas H. Davenport.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
32 Introduction to Data Science, Machine Learning & AI
Data Science Pipeline, EDA & Data Preparation
UNIT
3
Structure:
3.1 Introduction to Data Science Pipeline
3.2 Data Wrangling
3.3 Exploratory Data Analysis
3.4 Data Extraction & Cleansing
3.5 Statistical Modelling
3.6 Data Visualisation
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Data Science Pipeline, EDA & Data Preparation 33
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand what is meant by Data Science Pipeline
----------------------
• The meaning of Data Wrangling and Exploratory Data Analysis
---------------------- • Understand why cleansing the data is the most important part of Data
Science Understand the basics of Statistical Modelling
----------------------
• Know why visualising the data is an integral part of Data Science
---------------------- Work Cycle
----------------------
----------------------
3.1 INTRODUCTION TO DATA SCIENCE PIPELINE
----------------------
A data science pipeline is the overall step by step process towards
---------------------- obtaining, cleaning, visualising, modelling, and interpreting data within a
---------------------- business or group.
Data science pipelines are sequences of processing and analysis steps
---------------------- applied to data for a specific purpose.
---------------------- They’re useful in production projects, and they can also be useful if one
expects to encounter the same type of business question in the future, so as to
---------------------- save on design time and coding.
---------------------- Stages of Data Science Pipeline are as follows:
---------------------- 1) Problem Definition
Contrary to common belief, the hardest part of data science isn’t building
----------------------
an accurate model or obtaining good, clean data. It is much harder to
---------------------- define feasible problems and come up with reasonable ways of measuring
solutions. Problem definition aims at understanding, in depth, a given
---------------------- problem at hand.
---------------------- Multiple brainstorming sessions are organized to correctly define a
problem because of your end goal with depending upon what problem you
---------------------- are trying to solve. Hence, if you go wrong during the problem definition
---------------------- phase itself, you will be delivering a solution to a problem which never
even existed at first.
----------------------
----------------------
----------------------
----------------------
----------------------
34 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
2) Hypothesis Testing
Hypothesis testing is an act in statistics whereby an analyst tests an ----------------------
assumption regarding a population parameter. The methodology employed ----------------------
by the analyst depends on the nature of the data used and the reason for
the analysis. ----------------------
Hypothesis testing is used to infer the result of a hypothesis performed ----------------------
on sample data from a larger population. In simple words, we form some
assumptions during problem definition phase and then validate those ----------------------
assumptions statistically using data.
----------------------
3) Data Collection and processing
----------------------
Data collection is the process of gathering and measuring information
on variables of interest, in an established systematic fashion that enables ----------------------
one to answer stated research questions, test hypotheses, and evaluate
outcomes. Moreover, the data collection component of research is ----------------------
common to all fields of study including physical and social sciences,
----------------------
humanities, business, etc.
While methods vary by discipline, the emphasis on ensuring accurate and ----------------------
honest collection remains the same. ----------------------
Data processing is more about a series of actions or steps performed
on data to verify, organize, transform, integrate, and extract data in an ----------------------
appropriate output form for subsequent use. Methods of processing must ----------------------
be rigorously documented to ensure the utility and integrity of the data.
4) EDA and Feature Engineering ----------------------
Once you have clean and transformed data, the next step for machine ----------------------
learning projects is to become intimately familiar with the data using
----------------------
exploratory data analysis (EDA).
EDA is about numeric summaries, plots, aggregations, distributions, ----------------------
densities, reviewing all the levels of factor variables and applying general
----------------------
statistical methods.
Data Science Pipeline, EDA & Data Preparation 35
Notes A clear understanding of the data provides the foundation for model
selection, i.e. choosing the correct machine learning algorithm to solve
---------------------- your problem.
---------------------- Feature engineering is the process of determining which predictor
variables will contribute the most to the predictive power of a machine
---------------------- learning algorithm.
---------------------- The process of feature engineering is as much of an art as a science.
Often feature engineering is a give-and-take process with exploratory
---------------------- data analysis to provide much-needed intuition about the data. It’s good
to have a domain expert around for this process, but it’s also good to use
----------------------
your imagination.
---------------------- 5) Modelling and Prediction
---------------------- Machine learning can be used to make predictions about the future. You
provide a model with a collection of training instances, fit the model on this
---------------------- data set, and then apply the model to new instances to make predictions.
---------------------- Predictive modelling is useful because you can make products that adapt
based on expected user behaviour. For example, if a viewer consistently
---------------------- watches the same broadcaster on a streaming service, the application can
---------------------- load that channel on application start-up.
6) Data Visualisation
----------------------
Data visualisation is the process of displaying data/information in
---------------------- graphical charts, figures, and bars. It is used as a means to deliver visual
reporting to users for the performance, operations or general statistics of
----------------------
data and model prediction.
---------------------- 7) Insight Generation and implementation
---------------------- Interpreting the data is more like communicating your findings to the
interested parties. If you can’t explain your findings to someone believe
---------------------- me, whatever you have done is of no use.
---------------------- Hence, this step becomes very crucial.
---------------------- The objective of this step is to first identify the business insight and then
correlate it to your data findings. Secondly, you might need to involve
---------------------- domain experts in correlating the findings with business problems.
---------------------- Domain experts can help you in visualising your findings according to
the business dimensions which will also aid in communicating facts to a
---------------------- non-technical audience.
----------------------
3.2 DATA WRANGLING
----------------------
Data wrangling is 80% of what a data scientist does. It’s where most of
---------------------- the real value is created.
----------------------
36 Introduction to Data Science, Machine Learning & AI
The first step in analytics is gathering data. Then as you begin to analyse Notes
and dig deep for answers, it often becomes necessary to connect to and mashup
information from a variety of data sources. ----------------------
Data can be messy, disorganised, and contain errors. As soon as you start ----------------------
working with it, you will see the need for enriching or expanding it, adding
groupings and calculations. Sometimes it is difficult to understand what changes ----------------------
have already been implemented.
----------------------
Moving between data wrangling and analytics tools slows the analytics
process—and can introduce errors. It’s important to find a data wrangling ----------------------
function that lets you easily make adjustments to data without leaving your
----------------------
analysis.
This is also called as Data Munging. It follows certain steps such as after ----------------------
extracting the data from different data sources, sorting of data using certain
----------------------
algorithm is performed, decompose the data into a different structured format
and finally store the data into another database. ----------------------
Some of the steps associated with Data Wrangling are: ----------------------
1. Load, explore, and analyse your data
----------------------
2. Drop the unnecessary columns like columns containing IDs, Names, etc.
----------------------
3. Drop the columns which contain a lot of null or missing values
4. Impute missing values ----------------------
5. Replace invalid values ----------------------
6. Remove outliers ----------------------
7. Log Transform Skewed Variables
----------------------
8. Transform categorical variables to dummy variables
----------------------
10. Binning the continuous numeric variables
11. Standardisation and Normalisation ----------------------
Each of the above mentioned steps has a special importance with respect ----------------------
to Data Science.
----------------------
Let us look at an example.
If you want to visualise number of customers of a telecom provider by ----------------------
city, then you need to ensure that there is only one row per city before data ----------------------
visualisation.
----------------------
If you have two rows like Bombay and Mumbai representing the same
city, this could lead to wrong results. One of the rows has to be changed ----------------------
manually by the data analyst and this is done by creating a mapping on the fly
in the visualisation tool and applied to every row of data to detect for more such ----------------------
issues and the process is repeated for other cities.
----------------------
----------------------
Data Science Pipeline, EDA & Data Preparation 37
Notes Need of Data Wrangling
Data wrangling is an important aspect for implementing the statistical
----------------------
model.
---------------------- Therefore, data is converted to the proper feasible format before applying
any model intro it. By performing filtering, grouping and selecting appropriate
----------------------
data accuracy and performance of the model could be increased.
----------------------
3.3 EXPLORATORY DATA ANALYSIS
----------------------
Exploratory data analysis is, as the name mentions, a peek at the data you
----------------------
will be working with. Usually this involves
---------------------- 1. Cleaning the data - finding junk values and removing them, finding
outliers and replacing them appropriately (with the 95% percentile, for
----------------------
example) etc.
---------------------- 2. Summary Statistics - finding the summary statistics - mean, median and
if necessary, mode, along with the standard deviation and variance of the
----------------------
particular distribution
---------------------- 3. Univariate analysis - a simple histogram that shows the frequency of a
---------------------- particular variable’s occurrence, or a line chart that shows how a particular
variable changes over time to have a look at all the variables in the data
---------------------- and understand them.
---------------------- The idea is that, after performing Exploratory Data Analysis, you should
have a sound understanding of the data you are about to dive into. Further
---------------------- hypothesis based analysis (post EDA) could involve statistical testing, bi-
variate analysis etc.
----------------------
Let’s understand this with the help of an example.
----------------------
We all must have seen our mother taking a spoonful of soup to judge
---------------------- whether or not the salt is appropriate in the soup. The act of tasting the soup
to check the salt level and to better understand the taste of soup by taking a
---------------------- spoonful is exploratory data analysis. Based on that our mothers decide the salt
level, this is where they make inferences and their validity depends on whether
----------------------
or not the soup is well stirred that is to say whether or not the sample represents
---------------------- the whole population.
Let us take another business case example,
----------------------
Say we have given some data of sales and their daily revenue numbers for
---------------------- a big retail chain
---------------------- Business problem is – A retail chain wants to improve its revenue.
---------------------- The question that arises now is that what are the ways with which we can
achieve this?
----------------------
What will you look for? Do you know what to look for? Will you
---------------------- immediately run a code to find mean median and mode and other statistics?
38 Introduction to Data Science, Machine Learning & AI
The main objective is to understand the data inside out. The first step in Notes
any EDA is asking the right questions for which we want the answers for. If our
questions go wrong, the whole EDA goes wrong. ----------------------
The first step of any EDA is list down as many questions as you can on a ----------------------
piece of paper.
----------------------
What are some of the questions that we can ask? They are:
How many total stores are there in the retail company? ----------------------
Which stores and regions are performing the best and the worse? What ----------------------
are the actual sales across each and every store?
----------------------
How many stores are selling products below the average?
How many stores are exclusively selling best profit making products? On ----------------------
which days are the sales maximum? ----------------------
Do we see seasonal sales across products? Are there any abnormal sales
----------------------
numbers?
These are some of the questions that need to be asked before deciding on ----------------------
the next steps.
----------------------
It gives some very interesting insights out of data such as:
----------------------
1. Listing the outliers and anomalies in our data
2. Identifying the most important variables ----------------------
3. Understanding the relationship between variables ----------------------
4. Checking for any errors such as missing variables or incorrect entries ----------------------
5. Know the data types of the dataset – whether continuous/discreet/
----------------------
categorical
6. Understand how the data is distributed ----------------------
7. Testing a hypothesis or checking assumptions related to a specific model ----------------------
Exploratory data analysis (EDA) is very different from classical statistics. ----------------------
It is not about fitting models, parameter estimation, or testing hypotheses, but is
about finding information in data and generating ideas. ----------------------
So, this is the background of EDA. Technically, it involves steps like ----------------------
cleaning the data, calculating summary statistics and then making plots to better
understand the data at hand to make meaningful inferences. ----------------------
----------------------
3.4 DATA EXTRACTION & CLEANSING
Data extraction & cleaning (sometimes also referred to as data cleansing ----------------------
or data scrubbing) is the act of detecting and either removing or correcting ----------------------
corrupt or inaccurate records from a record set, table, or database. Used mainly
in cleansing databases, the process applies identifying incomplete, incorrect, ----------------------
inaccurate, irrelevant, etc. items of data and then replacing, modifying, or
deleting this “dirty” information. ----------------------
Data Science Pipeline, EDA & Data Preparation 39
Notes The next step after data cleaning is data reduction. This includes defining
and extracting attributes, decreasing the dimensions of data, representing the
---------------------- problems to be solved, summarising the data, and selecting portions of the data
for analysis.
----------------------
There are multiple data cleansing practices in vogue to clean and
---------------------- standardize bad data and make it effective, usable and relevant to business
needs.
----------------------
Organisations relying heavily on data driven business strategies need to
---------------------- choose a practice that best fits in with their operational working. A standard
practice is shown in the diagram below.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Detailed steps of this process are as follows:
1. Stored Data:
----------------------
Put together the data collected from all sources and create a data
---------------------- warehouse. Once your data is stored in a place, it is ready to be put through
the cleansing process.
----------------------
2. Identify errors:
----------------------
Multiple problems contribute to lowering the quality of data and making it
---------------------- dirty. Problems like inaccuracy, invalid data, incorrect data entry, missing
values, spell error, incorrect data ranges, multiple representation of data.
----------------------
These are some of the common errors which should be taken care in
---------------------- creating a cleansed data regime.
---------------------- 3. Remove duplication/redundancy
Multiple employees work on a single file where they collect and enter
----------------------
data. Most of the times, they don’t realise they are entering the same data
---------------------- collected by some other employee, at some other time. Such duplicate
data corrupts the data results and must be weeded out.
----------------------
4. Validate the accuracy of data
---------------------- Effective marketing occurs with high quality of data and thus validating
---------------------- the accuracy is the utmost prior thing organisations aim for. However, the
method of collection is independent of cleansing process.
----------------------
40 Introduction to Data Science, Machine Learning & AI
A triple verification of data will enhance the dataset and build trust Notes
worthiness in marketers and sales professional to utilize the power of
data. ----------------------
5. Standardise data format ----------------------
Now that data is validated, it is more important to put all the data in a
----------------------
standardised and accessible format. This ensures entered data is clean and
enriched for ready to use. ----------------------
Some of the other best practices which need to be followed while Data
----------------------
Cleansing are:
Sort data by different attributes ----------------------
For large datasets cleanse it stepwise and improve the data with each step ----------------------
until you achieve a good data quality
----------------------
For large datasets, break them into small data. Working with less data will
increase your iteration speed ----------------------
To handle common cleansing task create a set of utility functions/tools/ ----------------------
scripts. It might include, remapping values based on a CSV file or SQL
database or, regex search-and-replace, blanking out all values that don’t ----------------------
match a regex
----------------------
If you have an issue with data cleanliness, arrange them by estimated
frequency and attack the most common problems ----------------------
Analyse the summary statistics for each column (standard deviation, ----------------------
mean, number of missing values)
----------------------
Keep track of every date cleaning operation, so you can alter changes or
remove operations if required ----------------------
But keep in mind that all these are standard practices and they might ----------------------
and might not apply every time to a given problem. For example, if we have a
numerical data, we might want to remove missing values, NAs at first. ----------------------
For textual data, tokenisation, removing whitespace, punctuation, ----------------------
stopwords, stemming can be all possible steps towards cleaning data for further
analysis. ----------------------
Thus Data Cleansing is imperative for model building. If the data is ----------------------
garbage, then the output will also be garbage no matter how great of statistical
analysis is applied on it. ----------------------
----------------------
3.5 STATISTICAL MODELLING
----------------------
In simple terms, statistical modelling is a simplified, mathematically-
formalized way to approximate reality (i.e. what generates your data) and ----------------------
optionally to make predictions from this approximation. The statistical model
----------------------
is the mathematical equation that is used.
----------------------
Data Science Pipeline, EDA & Data Preparation 41
Notes Statistical modelling, is, literally, building statistical models. A linear
regression is a statistical model.
----------------------
To do any kind of statistical modelling, it is utmost necessary to know the
---------------------- basics of statistics like:
Basic statistics: Mean, Median, Mode, Variance, Standard Deviation,
----------------------
Percentile, etc.
---------------------- Probability Distribution: Geometric Distribution, Binomial Distribution,
Poisson distribution, Normal Distribution, etc.
----------------------
Population and Sample: understanding the basic concepts, the concept of
---------------------- sampling
---------------------- Confidence Interval and Hypothesis Testing: How to Perform Validation
Analysis
----------------------
Correlation and Regression Analysis: Basic Models for General Data
---------------------- Analysis
---------------------- Statistical modeling is a step which comes after data cleansing. The most
important parts are model selection, configuration, prediction, evaluation &
---------------------- presentation.
---------------------- Let us look at each one of these in brief.
---------------------- 1) Model Selection
One among many machine learning algorithms may be appropriate for a
---------------------- given predictive modeling problem. The process of selecting one method
---------------------- as the solution is called model selection.
This may involve a suite of criteria both from stakeholders in the project
---------------------- and the careful interpretation of the estimated skill of the methods
---------------------- evaluated for the problem.
As with model configuration, two classes of statistical methods can be
----------------------
used to interpret the estimated skill of different models for the purposes
---------------------- of model selection. They are:
Statistical Hypothesis Tests. Methods that quantify the likelihood of
----------------------
observing the result given an assumption or expectation about the
---------------------- result (presented using critical values and p-values).
Estimation Statistics. Methods that quantify the uncertainty of a
----------------------
result using confidence intervals.
---------------------- 2) Model Configuration
---------------------- A given machine learning algorithm often has a suite of hyperparameters
(parameters passed to the statistical model which can be changed) that
---------------------- allow the learning method to be tailored to a specific problem.
---------------------- The configuration of the hyperparameters is often empirical in nature,
rather than analytical, requiring large suites of experiments in order
----------------------
42 Introduction to Data Science, Machine Learning & AI
to evaluate the effect of different hyperparameter values on the skill of Notes
the model.
----------------------
Hyperparameters are the ones which can break or make a model.
Hyperparameter tuning is a very famous practice in the world of Data ----------------------
Science.
----------------------
The 2 methods by which we can do Hyperparameter tuning are:
Grid Search ----------------------
Random Search ----------------------
3) Model Evaluation ----------------------
A crucial part of a predictive modeling problem is evaluating a learning
method. ----------------------
This often requires the estimation of the skill of the model when making ----------------------
predictions on data not seen during the training of the model.
----------------------
Generally, the planning of this process of training and evaluating a
predictive model is called experimental design. This is a whole subfield ----------------------
of statistical methods.
----------------------
Experimental Design. Methods to design systematic experiments to
compare the effect of independent variables on an outcome, such as the ----------------------
choice of a machine learning algorithm on prediction accuracy. ----------------------
As part of implementing an experimental design, methods are used to
resample a dataset in order to make economic use of available data in ----------------------
order to estimate the skill of the model. These two represent a subfield of ----------------------
statistical methods.
Resampling Methods. Methods for systematically splitting a dataset into ----------------------
subsets for the purposes of training and evaluating a predictive model. ----------------------
4) Model Presentation
----------------------
Once a final model has been trained, it can be presented to stakeholders
prior to being used or deployed to make actual predictions on real data. ----------------------
A part of presenting a final model involves presenting the estimated skill ----------------------
of the model.
----------------------
Methods from the field of estimation statistics can be used to quantify the
uncertainty in the estimated skill of the machine learning model through ----------------------
the use of tolerance intervals and confidence intervals.
----------------------
Estimation Statistics. Methods that quantify the uncertainty in the skill of
a model via confidence intervals. ----------------------
----------------------
3.6 DATA VISUALISATION
----------------------
Data Visualisation is the representation of information in the form of
chart, diagram, picture, etc. ----------------------
Data Science Pipeline, EDA & Data Preparation 43
Notes These are created as the visual representation of information.
Importance of Data Visualisation:
----------------------
Absorb information quickly
----------------------
Understand your next steps
---------------------- Connect the dots
---------------------- Hold your audience longer
---------------------- Kick the need for data scientists
Share your insights with everyone
----------------------
Find the outliers
----------------------
Memorise the important insights
---------------------- Act on your findings quickly
---------------------- There are 10 elements of successful data visualisation:
---------------------- It tells a visual story
It’s easy to understand
----------------------
It’s tailored for your target
---------------------- audience It’s user friendly
---------------------- It’s useful
---------------------- It’s honest
It’s succinct
----------------------
It provides context
----------------------
Data science is useless if you can’t communicate your findings to others,
---------------------- and visualisations are imperative if you’re speaking to a non-technical audience.
If you come into a board room without presenting any visuals, you’re going to
---------------------- run out of work pretty soon.
---------------------- More than that, visualisations are very helpful for data scientists
themselves. Visual representations are much more intuitive to grasp than
---------------------- numerical abstractions
---------------------- Let’s consider an example
---------------------- The below plot is a chart which shows total air passengers across time for
a particular airline.
----------------------
----------------------
----------------------
----------------------
----------------------
44 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Just by glancing at the chart for two seconds, we immediately recognize ----------------------
a seasonal trend and a long-term trend. Identifying those patterns by analysing
----------------------
the numbers alone would require decomposing the signal in several steps.
Thus you require visualisations at two places: ----------------------
You need to understand the data yourself so you need to create visualisations ----------------------
which will probably never be shared.
----------------------
You need to get the data’s story across and visualisation is usually the best
way to go. ----------------------
Visualisations are helpful both in pre-processing and post-processing ----------------------
stages. They help us understand our datasets and results in the form of shapes
and objects which is somehow more real to the human brain. ----------------------
What is the future of data visualisation? ----------------------
There are currently three key trends that are probably going to shape the
----------------------
future of data visualisation:
Interactivity, Automation, and storytelling (VR). ----------------------
1) Interactivity ----------------------
Interactivity has been a key element of online Data Visualisation for ----------------------
numerous years. But it is currently beginning to overwhelm static
visualisations as for the predominant manner in which visualisations are ----------------------
introduced - particularly in news media. It is progressively expected that
every online map, chart, and a graph is interactive as well as energised. ----------------------
The challenge of interactivity is to give choices obliging an extensive ----------------------
range of users and corresponding necessities, without overcomplicating
----------------------
the UI of the data visualisation. There are 7 key sorts of interactivity, as
shown below: ----------------------
Reconfigure
----------------------
Choosing features Encode
----------------------
Abstract/elaborate Explore
Connect Filter ----------------------
Data Science Pipeline, EDA & Data Preparation 45
Notes 2) Automation
In the past, Data Visualisation was a tedious and troublesome process.
----------------------
The current test is to automate the Big Data Visualisation to regulate huge
---------------------- picture trends, however, without dismissing the sight of interest.
Best practice visualisation and design standards are vital. But there should
----------------------
be a match between the kind of visualisation and the reason for which it
---------------------- will be utilised.
3) Storytelling and VR
----------------------
Storytelling with data is popular, and rightfully so. Data Visualisations
---------------------- are vacant of significance without a story, and stories can be enormously
---------------------- enhanced when supplemented with data visualisation.
The future of storytelling might be virtual reality. The human visual
---------------------- awareness system is upgraded to seeing and interfacing in three
---------------------- measurements. The full storytelling capability of data visualisation can
be investigated once it is no longer compelled to flat screens.
---------------------- Some of the best Data Visualisation tools for Data Science are:
---------------------- 1) Tableau
---------------------- 2) QlikView
3) PowerBi
----------------------
4) QlikSense
----------------------
5) FusionCharts
----------------------
6) HighCharts
---------------------- 7) Plotly
---------------------- But the most important one if you are playing with R is Ggplot2 and that
with respect to Python is Seaborn or Matplotlib.
----------------------
Let us discuss in detail a bit about Ggplot2.
---------------------- What is GGPLOT?
---------------------- Ggplot2 is a data visualisation package for the statistical programming
language R, which tries to take the good parts of base and lattice graphics and
---------------------- none of the bad parts.
---------------------- It takes care of many of the fiddly details that make plotting a hassle (like
drawing legends) as well as providing a powerful model of graphics that makes
----------------------
it easy to produce complex multi-layered graphics.
---------------------- The 5 main reasons why you should explore ggplot are as follows:
---------------------- It can do quick-and-dirty and complex, so you only need one system The
default colours and other aesthetics are nicer.
----------------------
Never again lose an axis title (or get told your pdf can’t be created) due to
---------------------- wrongly specified outer or inner margins.
46 Introduction to Data Science, Machine Learning & AI
You can save plots (or the beginnings of a plot) as objects Notes
Multivariate exploration is greatly simplified through faceting and
----------------------
colouring.
Data Visualisation will change the manner in which our analysts work with ----------------------
data. They will be relied upon to respond to issues more quickly and required to
----------------------
dig for more insights – look at information differently, more creatively.
Data Visualisation will advance that imaginative data analysis. ----------------------
----------------------
Check your Progress 1
----------------------
Answer the questions.
----------------------
1. What are the components of Data Science Pipeline?
2. Name some Data Visualisation tools. ----------------------
3. What are the four steps involved in model building? ----------------------
4. What is EDA? ----------------------
5. What is Data Wrangling? ----------------------
----------------------
Activity 1
----------------------
Find and list the more data visualisation tools. ----------------------
----------------------
Summary ----------------------
●● Data Science is a combination of multiple fields which involves creation, ----------------------
preparation, transformation, modelling, and visualisation of data.
----------------------
●● Data Science pipeline consists of Data Wrangling, Data Cleansing &
Extraction, EDA, Statistical ----------------------
●● Model Building, and Data Visualisation.
----------------------
●● Data Wrangling is a step in which the data needs to transformed and
aggregated into usable format through which insights can be derived. ----------------------
●● Data Cleansing is an important step in which data needs to be cleansed ----------------------
like replacing the missing values, replacing NaN’s in data, removing
outliers along with standardisation and normalisation. ----------------------
●● Data Visualisation is a process of visualising the data so as to derive ----------------------
insights from it at a glance.
----------------------
●● It is also used to present results of the data science problem.
●● Statistical modelling is the core of Data Science problem solution. It is ----------------------
fitting of statistical equations on the data at hand to predict a certain value
on future observations. ----------------------
Data Science Pipeline, EDA & Data Preparation 47
Notes Keywords
----------------------
●● Data Science Pipeline: The 7 major stages of solving a Data Science
---------------------- problem.
●● Data Wrangling: The art of transforming the data into a format through
---------------------- which it is easier to draw insights from.
---------------------- ●● Data Cleansing: The process of cleaning the data of missing, garbage,
Nan’s and outliers.
----------------------
●● Data Visualisation: The art of building graphs and charts so as to
---------------------- understand data easily and find insights into it.
---------------------- ●● Statistical Modelling: The implementation of statistical equations on
existing data.
----------------------
---------------------- Self-Assessment Questions
---------------------- 1. What is Data Science Pipeline?
2. Why is there a need for Data Wrangling?
----------------------
3. What are the steps involved in Data Cleansing?
----------------------
4. What are the basics required to perform statistical modelling?
---------------------- 5. What do you mean by Data Visualisation and where is it used?
----------------------
Answers to Check your Progress
----------------------
Check your Progress 1
----------------------
1) Components of Data Science Pipeline are:
---------------------- a. Identifying the problem
---------------------- b. Hypothesis testing
---------------------- c. Data collection & data wrangling
d. EDA
----------------------
e. Statistical Modelling
----------------------
f. Interpreting and communicating results
---------------------- g. Data Visualisation and Insight Generation
---------------------- 2) Some Data Visualisation tools are:
---------------------- a. Tableau
b. Power Bi
----------------------
c. R & Python
----------------------
d. Qlikeview and Qliksense
----------------------
48 Introduction to Data Science, Machine Learning & AI
3) 4 steps involved in model building are: Notes
a. Model selection
----------------------
b. Model configuration
----------------------
c. Model evaluation
d. Model presentation ----------------------
4) EDA is exploratory data analysis which refers to refers to the critical ----------------------
process of performing initial investigations on data so as to discover
patterns, to spot anomalies, to test hypothesis and to check assumptions ----------------------
with the help of summary statistics and graphical representations ----------------------
5) Data wrangling is the process of cleaning and unifying messy and complex
data sets for easy access and analysis. ----------------------
----------------------
Suggested Reading
----------------------
1. Jeffrey Stanton, An Introduction to Data Science. ----------------------
2. The Data Science Handbook, Book by Field Cady.
----------------------
3. Hands-On Data Science and Python Machine Learning, Book by Frank
Kane. ----------------------
4. Data Science in Practice. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Data Science Pipeline, EDA & Data Preparation 49
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
50 Introduction to Data Science, Machine Learning & AI
Data Science Pipeline, EDA & Data Preparation
UNIT
4
Structure:
4.1 Data Scientist’s Toolbox
4.2 Applications & Case Study of Data Science
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Data Scientist Toolbox, Applications & Case Studies 51
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand what are the tools inside the Data Scientist
----------------------
• Toolbox Know different applications of Data Science
---------------------- • Understand how a Data Science Lifecycle works
----------------------
---------------------- 4.1 DATA SCIENTIST’S TOOLBOX
---------------------- Data scientists are responsible for discovering insights from massive
amounts of structured and unstructured data to help shape or meet specific
----------------------
business needs and goals. The data scientist role is becoming increasingly
---------------------- important as businesses rely more heavily on data analytics to drive decision-
making and lean on automation and machine learning as core components of
---------------------- their IT strategies.
---------------------- A data scientist’s approach to data analysis depends on their industry and
the specific needs of the business or department they are working for.
----------------------
Before a data scientist can find meaning in structured or unstructured
---------------------- data, business leaders and department managers must communicate what
they’re looking for. As such, a data scientist must have enough business
---------------------- domain expertise to translate company or departmental goals into data-based
---------------------- deliverables such as prediction engines, pattern detection analysis, optimisation
algorithms, and the like.
----------------------
A Data Scientists toolbox is one of a kind. It can vary from one organisation
---------------------- to organisation.
A set of generalistic tools that are used are as follows:
----------------------
a) R
----------------------
b) Python
---------------------- c) SQL
---------------------- d) Tableau
---------------------- e) PowerBi
f) Hadoop
----------------------
g) Tensorflow
----------------------
h) Apache Spark
---------------------- i) Statistics
----------------------
----------------------
52 Introduction to Data Science, Machine Learning & AI
Let’s go through each one in detail. Notes
1) R Programming
----------------------
The R programming language includes a set of function that supports
machine learning algorithm, linear regression, classification, statistical ----------------------
inference and so on. The best algorithms for machine learning can be
----------------------
implemented with R.
Today, R is not only used by academic but most of the large companies ----------------------
also use R programming language including Google, Facebook, YouTube
----------------------
and Amazon and so on.
R is freely available under the GNU General Public License. It offers ----------------------
good organised data analytics capabilities and most importantly it has ----------------------
an active community of user online where they can turn for support. R is
the first choice in the healthcare industry, followed by government and ----------------------
consulting.
----------------------
R is specifically designed for Data Science needs. Data Scientist are using
R to solve statistical problems. R has a steep learning curve. ----------------------
Some of the aspects of R programming are: ----------------------
The style of coding is quite easy
----------------------
It’s open source. No need to pay any subscription charges
----------------------
The community support is overwhelming. There are numerous
forums to help you out Get high performance computing experience. ----------------------
One of highly sought skill by analytics & Data Science companies ----------------------
Statistical analysis environment.
Open source ----------------------
Huge community support ----------------------
Availability of packages ----------------------
Benefits of charting
----------------------
R has some features that are important for Data Science applications-
----------------------
1) R being a vector language can perform many operations at once.
2) R doesn’t need any compilers as it’s an interpreted language. ----------------------
3) For statistical analysis & graphs, there is no better option than ----------------------
R programming, with capabilities around matrix multiplication
----------------------
available straight out of the box.
4) R programming provides support functions for Data Science ----------------------
applications.
----------------------
5) The ability of R programming to translate math to code seamlessly
and makes it an ideal choice for someone with minimal programming ----------------------
knowledge.
----------------------
Data Scientist Toolbox, Applications & Case Studies 53
Notes Why is R important for data science?
You can run your code without any Compiler – R is an interpreted
----------------------
language. Hence we can run Code without any compiler. R interprets the
---------------------- Code and makes the development of code easier. Many calculations done
with vectors – R is a vector language, so anyone can add functions to a
---------------------- single Vector without putting in a loop. Hence, R is powerful and faster
than other languages. Statistical Language– R used in biology, genetics as
----------------------
well as in statistics. R is a turning complete language where any type of
---------------------- task can perform.
For example,
----------------------
If you are interested in calculating the average of 10 numbers, you would
---------------------- probably write a for loop, calculate the total, maintain a counter and
calculate the average. So, you deal with the numbers one-by-one.
----------------------
If you are interested in applying a common formula to a set of numbers,
---------------------- you would probably store the numbers in an array, use for loop, apply the
---------------------- operation to each number and obtain the result.
Conversely, R operates on vectors! This is critical. You have to think in
---------------------- vectors. The same first aforementioned example can be solved by using
---------------------- one function mean (set of numbers). So, you are ideally not looking at
each number anymore. You are looking a set of numbers i.e. a vector and
---------------------- doing an operation. This is called vectorisation.
---------------------- As statisticians always deal with a set of data points, they have developed
a statistical package called R which basically operates on vectors. R is
---------------------- optimised for vectorised operations.
---------------------- So, many of the statistical operations are in-built in R. With this design
philosophy, handling datasets is very natural to R. Other competitive
---------------------- benefits is using R for data analysis:
---------------------- 6000 packages on CRAN spread across various domains of study.
Strong support on “stack overflow” and good documentation reducing the
---------------------- learning curve for beginners.
---------------------- Availability of *almost all* machine learning packages.
---------------------- Incredible plotting system (ggplot2).
2) Python Programming
----------------------
Python is a popular open source programming language and it is one of the
---------------------- most-used languages in artificial intelligence and other related scientific
fields.
----------------------
Machine learning (ML), on the other hand, is the field of artificial
---------------------- intelligence that uses algorithms to learn from data and make predictions.
Machine learning helps predict the world around us.
----------------------
From self-driving cars to stock market predictions to online learning,
---------------------- machine learning is used in almost every field that utilises prediction as
54 Introduction to Data Science, Machine Learning & AI
a way to improve itself. Due to its practical usage, it is one of the most Notes
in-demand skills right now in the job market. You don’t have to be a data
scientist to be fascinated by the world of machine learning, but a few ----------------------
travel guides might help you navigate the vast universe that also includes
big data, artificial intelligence, and deep learning, along with a large dose ----------------------
of statistics and analytics. ----------------------
Also, getting started with Python and machine learning is easy as there are
----------------------
plenty of online resources and lots of Python Machine learning libraries
available. ----------------------
For example some libraries in Python are:
----------------------
Theano
----------------------
Released nearly a decade ago and primarily developed by a machine
learning group at University de Montréal, it is one of the most-used CPU ----------------------
and GPU mathematical compilers in the machine learning community.
----------------------
TensorFlow
It is an open source library for numerical computing using data flow ----------------------
graphs, is a newcomer to the world of open source, but this Google-led ----------------------
project already has almost 15,000 commits and more than 600 contributors
on GitHub, and nearly 12,000 stars on its models repository. ----------------------
Scikit-learn ----------------------
A free software machine learning library for the Python programming
----------------------
language. It features various classification, regression and clustering
algorithms including support vector machines, random forests, gradient ----------------------
boosting, k-means and DBSCAN, and is designed to inter-operate with
the Python numerical and scientific libraries NumPy and SciPy. ----------------------
Pandas ----------------------
Pandas contain high level data structures and manipulation tools so as to ----------------------
make data analysis faster and easier in Python.
Seaborn ----------------------
Seaborn is a statistical plotting library in Python. So whenever you’re ----------------------
using Python for data science, you will be using matplotlib (for 2D
----------------------
visualisations) and Seaborn, which has its beautiful default styles and a
high level interface to draw statistical graphics. ----------------------
The reasons why Data Scientists work on Python are:
----------------------
Python is a free, flexible and powerful open source language.
----------------------
Python cuts development time in half with its simple and easy to read
syntax. ----------------------
With Python, you can perform data manipulation, analysis, and ----------------------
visualisation.
----------------------
Data Scientist Toolbox, Applications & Case Studies 55
Notes Python provides powerful libraries for Machine learning applications and
other scientific computations.
----------------------
Fundamentals of Python Programming:
---------------------- Variables
---------------------- Variables refers to the reserved memory locations to store the values. In
Python, you don’t need to declare variables before using them or even
---------------------- declare their type.
---------------------- Data Types
---------------------- Python supports numerous data types, which defines the operations
possible on the variables and the storage method. The list of data types
---------------------- includes – Numeric, Lists, Strings, Tuples, Sets and Dictionary.
---------------------- Operators
Operators helps to manipulate the value of operands. The list of operators
----------------------
in Python includes-Arithmetic, Comparison, Assignment, Logical,
---------------------- Bitwise, Membership, and Identity.
Conditional Statements
----------------------
Conditional statements helps to execute a set of statements based on a
---------------------- condition. There are namely three conditional statements – If, Elif, and
---------------------- Else.
Loops
----------------------
Loops are used to iterate through small pieces of code. There are three
---------------------- types of loops namely
---------------------- – While, for, and nested loops.
Functions
----------------------
Functions are used to divide your code into useful blocks, allowing you to
---------------------- order the code, make it more readable, reuse it & save some time.
---------------------- Thus, Python for data science is a step by step process.
---------------------- 3) SQL
SQL is for storing, exporting, and importing data. Data scientists will
---------------------- frequently encounter the need to retrieve data from a SQL data store, such
---------------------- as SQL Server, to complete an analytics task. Additionally, data scientists
may need to store some results created in an external package, such as
---------------------- SAS, Python, or R, into a SQL Server database for documentation and
subsequent re-use. To perform these kinds of functions, you need to know
---------------------- how to obtain meta-data about the contents of a SQL data store, how to
---------------------- query data in a SQL data store, and how to add new data structures to an
existing database, and finally how to update and insert data into a SQL
---------------------- database.
----------------------
56 Introduction to Data Science, Machine Learning & AI
While SQL is all about data, it also permits program development, such as Notes
data mining based on statistical techniques or modeling based on artificial
intelligence rules and/or technical indicator. ----------------------
SQL is a critical tool to use in data science, mostly to prep and extract ----------------------
datasets.
----------------------
It’s easy to get “book knowledge” on SQL, such as understanding the
parts and meaning of each section of a Select query. But that’s entirely ----------------------
different from actually being able to solve real-world problems, with
realistic data. ----------------------
SQL basics is a huge term, which covers all the fundamental logic of ----------------------
SQL. It means you must learn about RDBMS, triggers, data types, relation
model, and commands in SQL. Let’s have a look into basics of SQL. ----------------------
1. SQL Statements - Basically SQL has 4 types of statements ----------------------
categorized as
----------------------
I) DML (Data Manipulation Language) - In this, you will
learn SELECT, INSERT, UPDATE, DELETE statements. ----------------------
The SELECT statement used to select a row or record. With
the INSERT statement, we insert a set of values into the table. ----------------------
Now, the UPDATE statement is used to update the values in ----------------------
a table, and last the DELETE statement, deletes the record in
the table. ----------------------
II) DDL (Data Definition Language) - These statements make ----------------------
changes in the system catalogue table. This cover 3 different
SQL statements - CREATE, ALTER, and DROP. We use these ----------------------
3 statements to create a new table, to add a column or rename
a column or table, and Drop to remove the data, indexes, ----------------------
triggers, and constraints for the table. ----------------------
III) DCL (Data Control Language) - These statements control
----------------------
the data in the database. Here, we use two commands such as
GRANT and REVOKE. GRANT is for allowing the specific ----------------------
task to the specific user. REVOKE is to cancel the granted
permissions. ----------------------
IV) TCL (Transaction Control Language) - These statements ----------------------
manage the database transactions. In other words, manages
changes that are done by DML statements. We use 3 ----------------------
statements to do this such as COMMIT, to save any transaction
----------------------
permanently, ROLLBACK, to restore the database and we
use SAVEPOINT to temporarily save a transaction. ----------------------
2. SQL Operators ----------------------
Operators perform an operation on one or two values or expressions.
SQL has mainly 5 operators - Arithmetic, Bitwise, Compound, ----------------------
Logical, and Comparison operators. ----------------------
Data Scientist Toolbox, Applications & Case Studies 57
Notes 3. SQL Joins
Joins in SQL are used to combine record from a table, based on the
----------------------
related column. SQL has 4 different types of joins - INNER, LEFT,
---------------------- RIGHT, FULL join.
4. SQL Subquery
----------------------
Subquery is also called as a Nested or Inner query. It is a query
---------------------- in another query embedded with WHERE Clause. We can use
subqueries with the SELECT, INSERT, UPDATE, and DELETE
----------------------
Statements.
---------------------- 5. SQL String Function
---------------------- We use string functions for string manipulation. Some major SQL
string functions are ASCII(), BIN(), CHAR(), CHARACTER_
---------------------- LENGTH(), CONCAT(), ELT().
---------------------- Thus, every Data Scientist must know the basics of SQL for Data
Preparation & Transformation.
----------------------
4) Hadoop
----------------------
Hadoop is an ecosystem of open source components that fundamentally
---------------------- changes the way enterprises store, process, and analyse data.
Unlike traditional systems, Hadoop enables multiple types of analytic
----------------------
workloads to run on the same data, at the same time, at massive scale on
---------------------- industry-standard hardware. CDH, Cloudera’s open source platform, is
the most popular distribution of Hadoop and related projects in the world
---------------------- (with support available via a Cloudera Enterprise subscription).
---------------------- Apache Hadoop is an open source software framework for storage and
large scale processing of data-sets on clusters of commodity hardware.
---------------------- Hadoop is an Apache top-level project being built and used by a global
---------------------- community of contributors and users. It is licensed under the Apache
License 2.0.
---------------------- The Apache Hadoop framework is composed of the following modules
---------------------- Hadoop Common
---------------------- Contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS)
----------------------
A distributed file-system that stores data on the commodity
---------------------- machines, providing very high aggregate bandwidth across the
cluster Hadoop YARN
----------------------
A resource-management platform responsible for managing
---------------------- compute resources in clusters and using them for scheduling of
users’ applications Hadoop MapReduce
----------------------
A programming model for large scale data processing
----------------------
58 Introduction to Data Science, Machine Learning & AI
All the modules in Hadoop are designed with a fundamental assumption Notes
that hardware failures (of individual machines, or racks of machines)
are common and thus should be automatically handled in software by ----------------------
the framework. Apache Hadoop’s MapReduce, and HDFS components
originally derived respectively from Google’s MapReduce and Google ----------------------
File System (GFS) papers. ----------------------
Beyond HDFS, YARN, and MapReduce, the entire Apache Hadoop
----------------------
“platform” is now commonly considered to consist of a number of related
projects as well: Apache Pig, Apache Hive, Apache HBase, and others. ----------------------
HDFS and MapReduce
----------------------
There are two primary components at the core of Apache Hadoop 1.x:
the Hadoop Distributed File System (HDFS) and the MapReduce parallel ----------------------
processing framework. These are both open source projects, inspired by
----------------------
technologies created inside Google.
Apache Hadoop is an open source software framework for storage and ----------------------
large scale processing of data-sets on clusters of commodity hardware. ----------------------
Hadoop is an Apache top-level project being built and used by a global
community of contributors and users. It is licensed under the Apache ----------------------
License 2.0.
----------------------
HDFS terminology
HDFS stores large files (typically in the range of gigabytes to terabytes) ----------------------
across multiple machines. It achieves reliability by replicating the data ----------------------
across multiple hosts, and hence does not require RAID storage on hosts.
With the default replication value, 3, data is stored on three nodes: two ----------------------
on the same rack, and one on a different rack. Data nodes can talk to each
other to rebalance data, to move copies around, and to keep the replication ----------------------
of data high. ----------------------
HDFS is not fully POSIX-compliant, because the requirements for a
----------------------
POSIX file-system differ from the target goals for a Hadoop application.
The trade-off of not having a fully POSIX-compliant file-system is ----------------------
increased performance for data throughput and support for non-POSIX
operations such as Append. ----------------------
5) Apache Spark ----------------------
Apache Spark is an open-source cluster computing framework for real- ----------------------
time processing. It has a thriving open-source community and is the most
active Apache project at the moment. ----------------------
Spark provides an interface for programming entire clusters with implicit ----------------------
data parallelism and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the MapReduce ----------------------
model to efficiently use more types of computations. ----------------------
----------------------
Data Scientist Toolbox, Applications & Case Studies 59
Notes Let us look at the features in detail:
Polyglot:
----------------------
Spark provides high-level APIs in Java, Scala, Python and R. Spark code
---------------------- can be written in any of these four languages. It provides a shell in Scala
and Python. The Scala shell can be accessed through ./bin/spark-shell and
----------------------
Python shell through ./bin/pyspark from the installed directory.
---------------------- Speed:
---------------------- Spark runs up to 100 times faster than Hadoop MapReduce for large-scale
data processing. Spark is able to achieve this speed through controlled
---------------------- partitioning. It manages data using partitions that help parallelise
---------------------- distributed data processing with minimal network traffic.
Multiple Formats:
----------------------
Spark supports multiple data sources such as Parquet, JSON, Hive, and
---------------------- Cassandra apart from the usual formats such as text files, CSV, and
RDBMS tables. The Data Source API provides a pluggable mechanism
---------------------- for accessing structured data though Spark SQL. Data sources can be
---------------------- more than just simple pipes that convert data and pull it into Spark.
Lazy Evaluation:
----------------------
Apache Spark delays its evaluation till it is absolutely necessary. This
---------------------- is one of the key factors contributing to its speed. For transformations,
Spark adds them to a DAG (Directed Acyclic Graph) of computation and
----------------------
only when the driver requests some data, does this DAG actually gets
---------------------- executed.
---------------------- Real Time Computation:
Spark’s computation is real-time and has low latency because of its in-
---------------------- memory computation. Spark is designed for massive scalability and the
---------------------- Spark team has documented users of the system running production
clusters with thousands of nodes and supports several computational
---------------------- models.
---------------------- Hadoop Integration:
Apache Spark provides smooth compatibility with Hadoop. This is a boon
----------------------
for all the Big Data engineers who started their careers with Hadoop.
---------------------- Spark is a potential replacement for the MapReduce functions of Hadoop,
while Spark has the ability to run on top of an existing Hadoop cluster
---------------------- using YARN for resource scheduling.
---------------------- Machine Learning:
Spark’s MLlib is the machine learning component which is handy when it
----------------------
comes to big data processing. It eradicates the need to use multiple tools,
---------------------- one for processing and one for machine learning. Spark provides data
engineers and data scientists with a powerful, unified engine that is both
---------------------- fast and easy to use.
60 Introduction to Data Science, Machine Learning & AI
6) Tensorflow Notes
Tensorflow is a library for creating deep learning models. I have
----------------------
used Tensorflow to count people in videos, to create a Chabot for a
company website, to recognise handwriting text from images, create ----------------------
recommendation system for articles.
----------------------
To learn TensorFlow, you should understand the meaning of Tensor and
how it works. Tensor is a mathematical object which is represented as ----------------------
arrays of higher dimensions. Such arrays of data differ with different
ranks and dimensions. These arrays are fed as input to the neural network. ----------------------
TensorFlow was developed initially to run large sets of numerical
----------------------
computations. It uses a data flow graph to process data and perform
computations. ----------------------
Tensorflow operates on the basis of a Computational graph. The graph
----------------------
consists of nodes and edges where each node performs of mathematical
operation like addition, subtraction, multiplication, etc., while edges carry ----------------------
the data.
----------------------
4.2 APPLICATIONS & CASE STUDY OF DATA SCIENCE ----------------------
Let us take a look at a Data Science Case Study in retail domain. To ----------------------
understand the case, it is imperative to understand the retail supply chain.
----------------------
Now, it is pretty clear looking at the below the whole supply chain flow.
Now let us look at indepth knowledge of case study. ----------------------
Goal ----------------------
A Fortune 500 retail company approaches you and illustrates the problem
of Demand Forecasting and Inventory Management. The demand forecasts that ----------------------
they are getting as an input and not correct and as a result there is problem with ----------------------
inventory as well. They want you to build a customised demand forecasting and
inventory planning solution which will cause less out of stocks at their stores ----------------------
and put inventories in right places.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Data Scientist Toolbox, Applications & Case Studies 61
Notes Approach taken:
The data was gathered from respective databases using SQL queries and
----------------------
database decisions were taken. As this was a structured data, we needed SQL
---------------------- databases to store it.
The data is at a store-product-day level and from the year 1990. So we
----------------------
would be requiring a Hadoop Cluster when we will be processing and storing
---------------------- the results.
All forecasting in retail depends on a degree of aggregation. The
----------------------
aggregations could be on product units, location or time buckets or promotion
---------------------- according to the objective of the forecasting activity. We need the forecast at
Day-Store-Product level. So we will be forecasting at that level.
----------------------
Retail sales data often exhibit strong trend, seasonal variations, serial
---------------------- correlation and regime shifts because any long span in the data may include
both economic growth, inflation, and unexpected events.
----------------------
Time series models have provided a solution to capturing these stylised
---------------------- characteristics. Thus, time series models have long been applied for market
level aggregate retail sales forecasting.
----------------------
Simple exponential smoothing and its extensions to include trend and
---------------------- seasonal (Holt-Winters), and ARIMA models have been the most frequent
time series models employed for market level sales forecasting. Even in the
---------------------- earliest references, reflecting controversies in the macroeconomic literature,
---------------------- the researchers raised the question of which of various time series models
performed best and how they compared with simple econometric models.
----------------------
So for this case study, we will be trying different time series models.
---------------------- The steps for this case study are:
---------------------- 1) Analyse if data has any missing values. If yes, then should they be replaced
by mean values?
----------------------
2) Are there any NaN’s in the data? If yes, then should they be replaced by
---------------------- mean values?
---------------------- 3) Check the data types of variables.
4) Do EDA on the data
----------------------
Try to see if the data is showing trends across time
---------------------- Check if there exists any kind of seasonality in the dataset
---------------------- Are particular products showing jumps in sales during only certain
periods? Is cannibalisation of products happening?
----------------------
Which are the best and worst selling products?
---------------------- Which stores are best and worst with respect to sales?
---------------------- How many stores are above and below the average sales number?
Are all the products showing similar trends in sales?
---------------------- Can we group certain products by their sales trend?
62 Introduction to Data Science, Machine Learning & AI
5) Run a forecast on data with statistical models and choose the best model Notes
across products (after grouping products into clusters which show similar
sales trend) ----------------------
6) Evaluate the results on the past data at hand ----------------------
7) Present the results and measure the impact
----------------------
8) The Outcome - Improved productivity and collaboration within the client
8% improvement in product availability during the seasonal sales leading ----------------------
to increased revenue to the tune of £ 1M annually.
----------------------
A stable and repeatable process for 6,000-8,000 SKUs is being currently
used across the online channel. ----------------------
Forecasts served as benchmark to measure lift from promotional activities. ----------------------
New process also ensured closer collaboration between various teams ----------------------
within the client organisation like allocation & replenishment, trading,
merchandising, and operations development. ----------------------
The above impact is just shown as an example of how the case study ----------------------
output should look like.
----------------------
We can also include facts like the forecast accuracy increased by about
15% and that led to greater sales of products by about 5%. ----------------------
This causes directly impact on the business audience which in turn lets ----------------------
you solve and excavate more and more problems in the business.
----------------------
Check your Progress 1
----------------------
Answer the Questions. ----------------------
1. What are the components of Data Scientist’s toolbox?
----------------------
2. What does open source mean?
----------------------
3. What are big data technologies a data scientist should know?
----------------------
Activity 1 ----------------------
----------------------
Find and list the commands used in MySQL database.
----------------------
----------------------
Summary
----------------------
●● Data Science is a combination of multiple fields which involves creation,
preparation, transformation, modelling, and visualisation of data. ----------------------
●● Data Science pipeline consists of Data Wrangling, Data Cleansing & ----------------------
Extraction, EDA, Statistical.
●● Model Building and Data Visualisation. ----------------------
Data Scientist Toolbox, Applications & Case Studies 63
Notes ●● Data Scientists toolbox consists of a wide variety of tools which require
data preparation to data visualisation.
---------------------- ●● Any data science problem should have a solution which can depict the
---------------------- business impact the solution might have.
●● If the impact can be shown in terms of revenue the better it is.
----------------------
---------------------- Keywords
---------------------- ●● Data Science Pipeline: The 7 major stages of solving a Data Science
problem.
----------------------
●● Data Wrangling: The art of transforming the data into a format through
---------------------- which it is easier to draw insights from.
---------------------- ●● Data Cleansing: The process of cleaning the data of missing, garbage,
NaN’s, and outliers.
---------------------- ●● Data Visualisation: The art of building graphs and charts so as to
understand data easily and find insights into it.
----------------------
●● Statistical Modelling: The implementation of statistical equations on
---------------------- existing data.
---------------------- ●● Data Scientist Toolbox: List of technologies which are needed by a Data
Scientist to solve a Data Science problem.
----------------------
---------------------- Self-Assessment Questions
---------------------- 1. Write a short note on:
---------------------- a. Hadoop
b. Python
----------------------
2. Explain the features of SQL.
----------------------
3. What are the applications of Data Scientist’s toolbox?
----------------------
Answers to Check your Progress
----------------------
Check your Progress 1
----------------------
1) Components of Data Scientist’s toolbox are:
----------------------
a. R & Python
---------------------- b. Hadoop & Spark
---------------------- c. SQL & NoSQL Databases
---------------------- d. PowerBi, Tableau and other visualisation tools
e. Statistics
----------------------
----------------------
64 Introduction to Data Science, Machine Learning & AI
2) Open source means that the software is available freely and can be Notes
distributed by making changes to the source code as per ones requirement.
----------------------
3) The big data technologies a Data Scientist should know are Hadoop and
Spark. ----------------------
----------------------
Suggested Reading
----------------------
1. Jeffrey Stanton, An Introduction to Data Science.
2. The Data Science Handbook, Book by Field Cady. ----------------------
3. Hands-On Data Science and Python Machine Learning, Book by Frank ----------------------
Kane.
----------------------
4. Data Science in Practice.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Data Scientist Toolbox, Applications & Case Studies 65
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
66 Introduction to Data Science, Machine Learning & AI
Basics of Machine Learning
UNIT
5
Structure:
5.1 Introduction
5.2 Basic Concept of Machine Learning
5.3 Classes of Machine Learning Algorithms
5.4 Deep Learning
5.5 Why use R or Python for Machine Learning?
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Basics of Machine Learning 67
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand the concept of machine learning
----------------------
• Explain the difference between supervised and unsupervised machine
---------------------- learning
• escribe classification, regression, dimension reduction, and
D
----------------------
clustering
----------------------
---------------------- 5.1 INTRODUCTION
----------------------
You interact with machine learning on a daily basis whether you recognize
---------------------- it or not. The advertisements you see online are of products you’re more likely
to buy, based on the things you’ve previously bought or looked at. Faces in the
---------------------- photos you upload to social media platforms are automatically identified and
tagged. Your car’s GPS predicts which routes will be busiest at certain times
----------------------
of day and re-plots your route to minimize journey length. Your email client
---------------------- progressively learns which emails you want and which ones you consider spam
to make your inbox less cluttered, and your home personal assistant recognizes
---------------------- your voice and responds to your requests. From the small improvements to our
daily lives such as this, to the big, society-changing ideas such as self-driving
----------------------
cars, robotic surgery, and the automated scanning for other Earth-like planets,
---------------------- machine learning has become an increasingly important part of modern life.
Machine learning isn’t just the domain of large tech companies or
----------------------
computer scientists. Anyone with basic programming skills can implement
---------------------- machine learning in their work. If you’re a scientist, machine learning can give
you extraordinary insights into the phenomena you’re studying. If you’re a
---------------------- journalist, it can help you understand patterns in your data that can delineate
your story. If you’re a business person, machine learning can help you target the
----------------------
right customers and predict which products will sell the best. If you’re someone
---------------------- with a question or problem, and have sufficient data to answer it, machine
learning can help you do just that.
----------------------
In this unit we’re going to define what actually mean by the term machine
---------------------- learning. You’ll learn the difference between an algorithm and a model, and
discover that machine learning techniques.
----------------------
---------------------- 5.2 BASIC CONCEPT OF MACHINE LEARNING
---------------------- Imagine you work as a researcher in a hospital. What if, when a new
patient is checked in, you could calculate the risk of them dying? This would
---------------------- allow the clinicians to treat high risk patients more aggressively and result in
more lives being saved. But, where would you start? What data would you
----------------------
68 Introduction to Data Science, Machine Learning & AI
use? How would you get this information from the data? The answer is to use Notes
machine learning.
----------------------
Machine learning, sometimes referred to as statistical learning, is a
subfield of artificial intelligence (AI) whereby algorithms “learn” patterns in ----------------------
data to perform specific tasks. Although algorithms may sound complicated,
they aren’t. In fact, the idea behind an algorithm is not complicated at all. An ----------------------
algorithm is simply a step-by-step process that we use to achieve something,
----------------------
that has a beginning and an end. Chefs have a different word for algorithms,
they call them “recipes”. At each stage in a recipe, you perform some kind of ----------------------
process, like beating an egg, then follow the next instruction in the recipe, such
as mixing the ingredients. ----------------------
So having gathered data on your patients, you train a machine learning ----------------------
algorithm to learn patterns in the data associated with their survival. Now, when
you gather data on a new patient, the algorithm can estimate the risk of that ----------------------
patient dying.
----------------------
As another example, imagine you work for a power company, and it’s
your job to make sure customers’ bills are estimated accurately. You train ----------------------
an algorithm to learn patterns of data associated with the electricity use of
----------------------
households. Now, when a new household joins the power company, you can
estimate how much money you should bill them each month. ----------------------
Finally, imagine you’re a political scientist, and you’re looking for types ----------------------
of voters that no one (including you) knows about. You train an algorithm to
identify patterns of voters in survey data, to better understand what motivates ----------------------
voters for a particular political party. Do you see any similarities between these
problems, and the problems you would like to solve? Then provided the solution ----------------------
is hidden somewhere in your data, you can train a machine learning algorithm ----------------------
to extract it for you.
5.2.1 Artificial Intelligence and Machine Learning ----------------------
Arthur Samuel, a scientist at IBM, first used the term machine learning in ----------------------
1959. He used it to describe a form of artificial intelligence (AI) that involved
training an algorithm to learn to play the game of checkers. The word learning is ----------------------
what’s important here, as this is what distinguishes machine learning approaches ----------------------
from traditional AI.
----------------------
Traditional AI is programmatic. In other words, you give the computer a
set of rules so that when it encounters new data, it knows precisely which output ----------------------
to give. The problem with this approach is that you need to know all possible
outputs the computer should give you in advance, and the system will never ----------------------
give you an output that you haven’t told it to give. Contrast this to the machine
----------------------
learning approach, where instead of telling the computer the rules, you give it
the data and allow it to learn the rules for itself. The advantage of this approach ----------------------
is that the “machine” can learn patterns we didn’t even know existed in the data,
and the more data we provide, the better it gets at learning those patterns. ----------------------
----------------------
Basics of Machine Learning 69
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 1.1: Traditional AI vs. Machine Learning
---------------------- 5.2.2 The difference between a Model and an Algorithm
In practice, we call a set of rules a machine learning algorithm learns a
---------------------- model. Once the model has been learned, we can give it new observations and it
---------------------- will output its predictions for the new data. We refer to these as models because
they represent real-world phenomena in a simplistic-enough way that we and
---------------------- the computer can interpret and understand. Just as a model of the Eiffel Tower
may be a good representation of the real thing but isn’t exactly the same, so
---------------------- statistical models are attempted representations of real-world phenomena but
---------------------- won’t match them perfectly.
The process by which the model is learned is referred to as the algorithm.
----------------------
As we discovered earlier, an algorithm is just a sequence of operations that
---------------------- work together to solve a problem. So how does this work in practice? Let’s take
a simple example. Say we have two continuous variables, and we would like
---------------------- to train an algorithm that can predict one (the outcome or dependent variable),
given the other (the predictor or independent variable). The relationship between
----------------------
these variables can be described by a straight line, which only needs two
---------------------- parameters to describe it: its slope, and where it crosses the y axis (y intercept).
This is shown in figure 1.2.
----------------------
70 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 5.2: Any straight line can be described by its slope
(the change in y divided by the change in x), and its intercept ----------------------
(where it crosses the y axis when x = 0). The equation y = intercept + slope* ----------------------
x can be used to predict the value of y, given a value of x.
An algorithm to learn this relationship could look something like the ----------------------
example in figure 1.3. We start by fitting a line with no slope through the mean ----------------------
of all the data. We calculate the distance each data point is from the line, square
it, and sum these squared values. This sum of squares is a measure of how ----------------------
closely the line fits the data. Next, we rotate the line a little in a clockwise
direction, and measure the sum of squares for this line. If the sum of squares is ----------------------
bigger than it was before, we’ve made the fit worse, so we rotate the slope in ----------------------
the other direction and try again. If the sum of squares gets smaller, then we’ve
made the fit better. We continue with this process, rotating the slope a little ----------------------
less each time we get closer, until the improvement on our previous iteration
is smaller than some pre-set value we’ve chosen. The algorithm has iteratively ----------------------
learned the model (the slope and y intercept) needed to predict future values of ----------------------
the output variable, given only the predictor variable. This example is slightly
crude, but hopefully illustrates how such an algorithm could work. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Basics of Machine Learning 71
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 5.3: A hypothetical algorithm for learning the parameters of a
----------------------
straight line. This algorithm takes two continuous variables as inputs,
---------------------- and fits a straight line through the mean. It iteratively rotates the
line until it finds a solution that minimises the sum of squares. The
---------------------- parameters of the line are output as the learned model.
---------------------- While certain algorithms tend to perform better than others with certain
types of data, there is no single algorithm that will always outperform all
---------------------- others on all problems. This concept is called the no free lunch theorem. In
other words, you don’t get something for nothing; you need to put some effort
----------------------
into working out the best algorithm for your particular problem. Instead, data
---------------------- scientists typically choose a few algorithms they know tend to work well for
the type of data and problem they are working on, and see which algorithm
---------------------- generates the best performing model.
----------------------
----------------------
72 Introduction to Data Science, Machine Learning & AI
5.3 CLASSES OF MACHINE LEARNING ALGORITHMS Notes
All machine learning algorithms can be categorized by their learning ----------------------
type, and the task they perform. There are three learning types:
----------------------
Supervised learning
----------------------
Unsupervised learning
Reinforcement learning ----------------------
This depends on how the algorithms learn. Do they require us to hold ----------------------
their hand through the learning process? Or do they learn the answers for
----------------------
themselves? Supervised and unsupervised algorithms can be further split into
two classes each: ----------------------
supervised
----------------------
classification
regression ----------------------
unsupervised ----------------------
dimension reduction
----------------------
clustering
This depends on what they learn to do. ----------------------
So we categorize algorithms by how they learn, and what they learn to ----------------------
do. But why do we care about this? Well, there are a lot of machine learning
----------------------
algorithms available to us. How do we know which one to pick? What kind of
data do they require to function properly? Knowing which categories different ----------------------
algorithms belong to makes our job of selecting the most appropriate ones much
simpler. ----------------------
5.3.1 Differences between Supervised, Unsupervised, and Semi-Supervised ----------------------
Learning
----------------------
Imagine you are trying to get a toddler to learn about shapes using blocks
of wood. In front of them, they have a ball, a cube, and a star. You ask them to ----------------------
show you the cube, and if they point to the correct shape you tell them they are
correct, and if they are incorrect you also tell them. You repeat this procedure ----------------------
until the toddler can identify the correct shape almost all of the time. This is
----------------------
called supervised learning, because you, the person who already knows which
shape is which, are supervising the learner by telling them the answers. ----------------------
Now imagine a toddler is given multiple balls, cubes, and stars, but this ----------------------
time is also given three bags. The toddler has to put all the balls in one bag,
the cubes in another bag, and the stars in another, but you won’t tell them if ----------------------
they’re correct, they have to work it out for themselves from nothing but the
information they have in front of them. This is called unsupervised learning, ----------------------
because the learner has to identify patterns themselves with no outside help. ----------------------
----------------------
Basics of Machine Learning 73
Notes So a machine learning algorithm is said to be supervised if it uses a ground
truth, or in other words, labeled data. For example, if we wanted to classify a
---------------------- patient biopsy as healthy or cancerous based on its gene expression, we would
give an algorithm the gene expression data, labeled with whether that tissue was
---------------------- healthy or cancerous. The algorithm now knows which cases come from each
---------------------- of the two types, and tries to learn patterns in the data that discriminate them.
Another example would be if we were trying to estimate how much
----------------------
someone spends on their credit card in a given month, we would give an
---------------------- algorithm information about them such as their income, family size, whether
they own their home etc., labeled with how much they spent on their credit
---------------------- card. The algorithm now knows how much each of the cases spent, and looks
for patterns in the data that can predict these values in a reproducible way.
----------------------
A machine learning algorithm is said to be unsupervised if it does not use
---------------------- a ground truth, and instead looks for patterns in the data on its own, that hint at
some underlying structure. For example, let’s say we take the gene expression
----------------------
data from lots of cancerous biopsies, and ask an algorithm to tell us if there are
---------------------- clusters of biopsies. A cluster is a group of data which are similar to each other,
but different from data in other clusters. This type of analysis can tell us if we
---------------------- have subgroups of cancer types which we may need to treat differently.
---------------------- Alternatively, we may have a dataset with a large number of variables,
so many that it is difficult to interpret and look for relationships manually. We
---------------------- can ask an algorithm to look for a way of representing this high-dimensional
dataset in a lower-dimensional one, while maintaining as much information
----------------------
from the original data as possible. Take a look at the summary in figure 1.4. If
---------------------- your algorithm uses labeled data (i.e. a ground truth), then it is supervised, and
if it does not use labeled data then it is unsupervised.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 5.4: Supervised Vs. Unsupervised Machine Learning
74 Introduction to Data Science, Machine Learning & AI
Semi-supervised learning Notes
Most machine learning algorithms will fall into one of these categories,
----------------------
but there is an additional approach called semi-supervised learning. As its name
suggests, semi-supervised machine learning is not quite supervised and not ----------------------
quite unsupervised.
----------------------
Semi-supervised learning often describes a machine learning approach
that combines supervised and unsupervised algorithms together, rather than ----------------------
strictly defining a class of algorithms in of itself. The premise of semi-supervised
learning is that, often, labeling a dataset may require a large amount of manual ----------------------
work by an expert observer. This process may be very time-consuming,
----------------------
expensive and error-prone, and may be impossible for an entire dataset. So
instead, we expertly label as many of the cases as is feasibly possible, then build ----------------------
a supervised model using only these labeled data. We pass the rest of our data
(the unlabeled cases) into the model, to get the predicted labels for these, which ----------------------
are called pseudo-labels, because we don’t know if all of them are actually
----------------------
correct. Now, we combine the data with the manual labels and pseudo-labels,
and use this to train a new model. ----------------------
This approach allows us to train a model that learns from both labeled and
----------------------
unlabeled data, and can improve overall predictive performance because we are
able to use all of the data at our disposal. ----------------------
Within the categories of supervised and unsupervised, machine learning ----------------------
algorithms can be further categorized by the tasks they perform. Just like a
mechanical engineer knows which tools to use for the task at hand, so the data ----------------------
scientist needs to know which algorithms they should use for their task. There
are four main classes to choose from: classification, regression, dimension ----------------------
reduction, and clustering. ----------------------
5.3.2 Classification, Regression, Dimension, Reduction, and Clustering
----------------------
Supervised machine learning algorithms can be split into two classes:
classification algorithms and regression algorithms. ----------------------
Classification algorithms take labeled data (because they are supervised ----------------------
learning methods) and learn patterns in the data that can be used to predict a
categorical output variable. This is most often a grouping variable (a variable ----------------------
specifying which group a particular case belongs to) and can be binomial (two
----------------------
groups) or multinomial (more than two groups). Classification problems are
very common machine learning tasks. Which customers will default on their ----------------------
payments? Which patients will survive? Which objects in a telescope image are
stars, planets or galaxies? When faced with problems like these, you should use ----------------------
a classification algorithm. Regression algorithms take labeled data, and learn
----------------------
patterns in the data that can be used to predict a continuous output variable.
How much carbon dioxide does a household contribute to the atmosphere? ----------------------
What will the share price of a company be tomorrow? What is the concentration
of insulin in a patient’s blood? When faced with problems like these, you should ----------------------
use a regression algorithm.
----------------------
Basics of Machine Learning 75
Notes Unsupervised machine learning algorithms can also be split into two
classes: dimension reduction and clustering algorithms.
----------------------
Dimension reduction algorithms take unlabeled (because they are
---------------------- unsupervised learning methods), high-dimensional data (data with many
variables) and learn a way of representing it in a lower number of dimensions.
---------------------- Dimension reduction techniques may be used as an exploratory technique
(because it’s very difficult for humans to visually interpret data in more than
----------------------
two or three dimensions at once), or as a pre-processing step in our machine
---------------------- learning pipeline (it can help mitigate problems such as collinearity and the
curse of dimensionality, terms we will define in later chapters). We can also use
---------------------- it to help us visually confirm the performance of classification and clustering
algorithms (by allowing us to plot the data in two or three dimensions).
----------------------
Clustering algorithms take unlabeled data, and learn patterns of clustering
---------------------- in the data. A cluster is a collection of observations which are more similar to
each other, than to data points in other clusters. We assume that observations
----------------------
in the same cluster share some unifying features or identity that makes them
---------------------- identifiably different from other clusters. Clustering algorithms may be used
as an exploratory technique to understand the structure of our data, and may
---------------------- indicate a grouping structure that can be fed into classification algorithms.
Are there subtypes of patient responders in a clinical trial? How many classes
----------------------
of respondent were there in the survey? Are there different types of customer
---------------------- that use our company? When faced with problems like these, you should use a
clustering algorithm.
----------------------
By separating machine learning algorithms into these four classes, you
---------------------- will find it easier to select appropriate ones for the tasks at hand. Deciding
which class of algorithm to choose from is usually straightforward:
----------------------
If you need to predict a categorical variable, use a classification algorithm.
---------------------- If you need to predict a continuous variable, use a regression algorithm.
---------------------- If you need to represent the information of many variables with fewer
variables, use dimension reduction.
----------------------
If you need to identify clusters of cases, use a clustering algorithm.
----------------------
---------------------- 5.4 DEEP LEARNING
---------------------- Deep learning is a subfield of machine learning (all deep learning is
machine learning, but not all machine learning is deep learning) that has become
---------------------- extremely popular in the last 5 to 10 years for two main reasons:
---------------------- It can produce models with outstanding performance
We now have the computational power to apply it more broadly
----------------------
Deep learning uses neural networks to learn patterns in data, a term
---------------------- referring to the way in which the structure of these models superficially
---------------------- resembles neurons in the brain, with connections allowing them to pass
76 Introduction to Data Science, Machine Learning & AI
information between them. The relationship between AI, machine learning, and Notes
deep learning is summarized in figure 5.5.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig.5.5: The relationship between Artificial Intelligence (AI), Machine
Learning, and Deep Learning ----------------------
While it’s true that deep learning methods will typically out-perform ----------------------
“shallow” learning methods (a term sometimes used to distinguish machine
learning methods that are not deep learning) for the same dataset, they are not ----------------------
always the best choice for a given problem. Deep learning methods often are
----------------------
not the most appropriate method for a given problem for three reasons:
They are computationally expensive. By expensive, we don’t mean ----------------------
monetary cost of course, we mean they require a lot of computing power, which ----------------------
means they can take a long time (hours or even days!) to train. Arguably this is
a less important reason not to use deep learning, because if a task is important ----------------------
enough to you, you can invest the time and computational resources required to
solve it. But if you can train a model in a few minutes that performs well, then ----------------------
why waste additional time and resources? ----------------------
They tend to require more data. Deep learning models typically require
hundreds to thousands of cases in order to perform extremely well. This ----------------------
largely depends on the complexity of the problem at hand, but shallow ----------------------
methods tend to perform better on small datasets than their deep learning
counterparts. ----------------------
The rules are less interpretable. By their nature, deep learning models ----------------------
favor performance over model interpretability. Arguably, our focus should
be on performance, but often we’re not only interested in getting the right ----------------------
output, we’re also interested in the rules the algorithm learned because
----------------------
these help us to interpret things about the real world and may help us
further our research. The rules learned by a neural network are not easy to ----------------------
interpret.
----------------------
So while deep learning methods can be extraordinarily powerful, shallow
learning techniques are still invaluable tools in the arsenal of data ----------------------
scientists.
----------------------
Basics of Machine Learning 77
Notes Deep learning algorithms are particularly good at tasks involving complex
data, such as image classification and audio transcription.
----------------------
5.5 WHY USE R OR PYTHON FOR MACHINE LEARNING?
----------------------
There is something of a rivalry between the two most commonly used data
---------------------- science languages: R and Python. Anyone who is new to machine learning will
choose one or the other to get started, and their decision will often be guided by
---------------------- the learning resources they have access to, which one is more commonly used
---------------------- in their field of work, and which one their colleagues use. There are no machine
learning tasks which are only possible to apply in one language or the other,
---------------------- although some of the more cutting-edge deep learning approaches are easier
to apply in Python (they tend to be written in Python first and implemented in
---------------------- R later). Python, while very good for data science, is a more general purpose
---------------------- programming language, whereas R is geared specifically for mathematical and
statistical applications. This means that users of R can focus purely on data, but
---------------------- may feel restricted if they ever need to build applications based on their models.
There are modern tools in R designed specifically to make data science tasks
---------------------- simple and human-readable, such as those from the tidyverse.
---------------------- Traditionally, machine learning algorithms in R were scattered across
multiple packages, written by different authors. This meant you would need to
----------------------
learn to use new functions with different arguments and implementations, each
---------------------- time you wanted to apply a new algorithm. Proponents of python could use
this as an example of why it was better suited for machine learning, as it has
---------------------- the well-known scikit-learn package which has a plethora of machine learning
algorithms built into it. But R has now followed suit, with the caret and mlr
----------------------
packages.
---------------------- The mlr package (which stands for machine learning in R) provides an
interface for a large number of machine learning algorithms, and allows you to
----------------------
perform extremely complicated machine learning tasks with very little coding.
----------------------
Check your Progress 1
----------------------
Fill in the blanks.
----------------------
1. A ______ is just a sequence of operations that work together to solve
---------------------- a problem.
---------------------- 2. A machine learning algorithm is said to be ______ if it uses a ground
truth, or in other words, labeled data.
----------------------
3. A ______ is a collection of observations which are more similar to
---------------------- each other, than to data points in other clusters.
---------------------- 4. Deep learning uses ______ to learn patterns in data, a term referring
to the way in which the structure of these models superficially
---------------------- resembles neurons in the brain, with connections allowing them to
pass information between them.
----------------------
78 Introduction to Data Science, Machine Learning & AI
Notes
Activity 1
----------------------
1. List the applications of deep learning.
----------------------
2. Search and list the algorithms used in Artificial Intelligence technique.
----------------------
----------------------
Summary
----------------------
●● Artificial intelligence is the appearance of intelligent knowledge by a
computer process. ----------------------
●● Machine learning is a subfield of artificial intelligence, where the computer ----------------------
learns relationships in data to make predictions on future, unseen data, or
to identify meaningful patterns that help us understand our data better. ----------------------
●● A machine learning algorithm is the process by which patterns and rules
----------------------
in the data are learned, and the model is the collection of those patterns
and rules which accepts new data, applies the rules to it, and outputs an ----------------------
answer.
----------------------
●● Deep learning is a subfield of machine learning, which is, itself, a subfield
of artificial intelligence. ----------------------
●● Machine learning algorithms are categorized/divided as supervised and
----------------------
unsupervised, depending on whether they learn from ground-truth-labeled
data (supervised learning) or unlabeled data (unsupervised learning). ----------------------
●● Supervised learning algorithms are categorized/divided into classification
----------------------
(if they predict a categorical variable) or regression (if they predict a
continuous variable). ----------------------
●● Unsupervised learning algorithms are categorized/divided into dimension
reduction (if they find a lower dimension representation of the data) or ----------------------
clustering (if they identify clusters of cases in the data). ----------------------
●● Along with Python, R is a popular data science language and contains
many tools and built-in data that simplify the process of learning data ----------------------
science and machine learning. ----------------------
Keywords ----------------------
●● Machine Learning: It is an application of artificial intelligence (AI) that ----------------------
provides systems the ability to automatically learn and improve from ----------------------
experience without being explicitly programmed.
●● Artificial Intelligence: It is the simulation of human intelligence ----------------------
processes by machines, especially computer systems. ----------------------
●● Algorithm: It is a finite sequence of well-defined, computer-
implementable instructions, typically to solve a class of problems or to ----------------------
perform a computation. ----------------------
Basics of Machine Learning 79
Notes ●● Deep Learning: A class of machine learning algorithms that uses multiple
layers to progressively extract higher level features from the raw input.
----------------------
---------------------- Self-Assessment Questions
---------------------- 1. State the difference between Supervised and Unsupervised learning.
2. Explain the concept of Deep Learning with example.
----------------------
3. Write a short note on Artificial Intelligence.
----------------------
---------------------- Answers to Check your Progress
---------------------- Check your Progress 1
---------------------- Fill in the blanks.
1. A algorithm is just a sequence of operations that work together to solve a
---------------------- problem.
---------------------- 2. A machine learning algorithm is said to be supervised if it uses a ground
truth, or in other words, labeled data.
----------------------
3. A cluster is a collection of observations which are more similar to each
---------------------- other, than to data points in other clusters.
---------------------- 4. Deep learning uses neural networks to learn patterns in data, a term
referring to the way in which the structure of these models superficially
---------------------- resembles neurons in the brain, with connections allowing them to pass
information between them.
----------------------
---------------------- Suggested Reading
---------------------- 1. Machine Learning with R, Tidyverse, and Mlr by Hefin Ioan Rhys
---------------------- 2. Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
---------------------- 3. Elements of Machine Learning by Pat Langley
4. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
----------------------
Concepts, Tools and Techniques to Build Intelligent Systems by Aurélien
---------------------- Géron
---------------------- 5. Introduction to Machine Learning by Ethem Alpaydin
----------------------
----------------------
The text is adapted by Symbiosis Centre for Distance Learning under a
---------------------- Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
---------------------- as requested by the work’s creator or licensees. This license is available at
https://creativecommons.org/licenses/by-sa/4.0/.
----------------------
80 Introduction to Data Science, Machine Learning & AI
Supervised Machine Learning
UNIT
6
Structure:
6.1 Introduction
6.2 Supervised Learning
6.3 Algorithm Types
6.3.1 K-Nearest-Neighbours (KNN) Algorithm
6.3.2 Naïve Bayes Classifier
6.3.3 Decision Tree
6.3.4 Support Vector Machine
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Supervised Machine Learning 81
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand the concept of supervised machine learning algorithms
----------------------
• Explain the different supervised machine learning algorithms
----------------------
---------------------- 6.1 INTRODUCTION
---------------------- Machine learning algorithms are organized into taxonomy, based on the
---------------------- desired outcome of the algorithm. Common algorithm types include:
●● Supervised learning – where the algorithm generates a function that maps
---------------------- inputs to desired outputs. One standard formulation of the supervised
---------------------- learning task is the classification problem: the learner is required to learn
(to approximate the behaviour of) a function which maps a vector into
---------------------- one of several classes by looking at several input-output examples of the
function.
----------------------
●● Unsupervised learning – which models a set of inputs: labeled examples
---------------------- are not available.
---------------------- ●● Semi-supervised learning – which combines both labeled and unlabeled
examples to generate an appropriate function or classifier.
---------------------- ●● Reinforcement learning – where the algorithm learns a policy of how to
---------------------- act given an observation of the world. Every action has some impact in
the environment, and the environment provides feedback that guides the
---------------------- learning algorithm.
●● Transduction – similar to supervised learning, but does not explicitly
----------------------
construct a function: instead, tries to predict new outputs based on training
---------------------- inputs, training outputs, and new inputs.
●● Learning to learn – where the algorithm learns its own inductive bias
----------------------
based on previous experience.
---------------------- The performance and computational analysis of machine learning
algorithms is a branch of statistics known as computational learning theory.
----------------------
Machine learning is about designing algorithms that allow a computer to learn.
---------------------- Learning is not necessarily involves consciousness but learning is a matter of
finding statistical regularities or other patterns in the data. Thus, many machine
---------------------- learning algorithms will barely resemble how human might approach a learning
task. However, learning algorithms can give insight into the relative difficulty
----------------------
of learning in different environments.
----------------------
6.2 SUPERVISED LEARNING
----------------------
Supervised learning is fairly common in classification problems because
---------------------- the goal is often to get the computer to learn a classification system that we have
82 Introduction to Data Science, Machine Learning & AI
created. Digit recognition, once again, is a common example of classification Notes
learning. More generally, classification learning is appropriate for any problem
where deducing a classification is useful and the classification is easy to ----------------------
determine. In some cases, it might not even be necessary to give predetermined
classifications to every instance of a problem if the agent can work out the ----------------------
classifications for itself. This would be an example of unsupervised learning in ----------------------
a classification context.
----------------------
Supervised Learning often leaves the probability for inputs undefined.
This model is not needed as long as the inputs are available, but if some of the ----------------------
input values are missing, it is not possible to infer anything about the outputs.
Unsupervised learning, all the observations are assumed to be caused by latent ----------------------
variables, that is, the observations is assumed to be at the end of the causal
----------------------
chain. Examples of supervised learning and unsupervised learning are shown in
the figure 2.1 below: ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 6.1: Supervised and Unsupervised Learning ----------------------
Supervised learning is the most common technique for training neural ----------------------
networks and decision trees. Both of these techniques are highly dependent on
the information given by the pre-determined classifications. In the case of neural ----------------------
networks, the classification is used to determine the error of the network and
then adjust the network to minimize it, and in decision trees, the classifications ----------------------
are used to determine what attributes provide the most information that can be ----------------------
used to solve the classification puzzle.
----------------------
We’ll look at both of these in more detail, but for now, it should be sufficient
to know that both of these examples thrive on having some “supervision” in ----------------------
the form of pre-determined classifications. Inductive machine learning is the
process of learning a set of rules from instances (examples in a training set), ----------------------
or more generally speaking, creating a classifier that can be used to generalize
----------------------
from new instances. The process of applying supervised ML to a real world
problem is described in Figure 6.2. ----------------------
The first step is collecting the dataset. If a requisite expert is available, then
----------------------
s/he could suggest which fields (attributes, features) are the most informative. If
not, then the simplest method is that of “brute-force,” which means measuring ----------------------
Supervised Machine Learning 83
Notes everything available in the hope that the right (informative, relevant) features
can be isolated. However, a dataset collected by the “brute-force” method is
---------------------- not directly suitable for induction. It contains in most cases noise and missing
feature values, and therefore requires significant pre-processing according to
---------------------- Zhang et al (Zhang, 2002).
---------------------- The second step is the data preparation and data pre-processing. Depending
on the circumstances, researchers have a number of methods to choose from
----------------------
to handle missing data, have recently introduced a survey of contemporary
---------------------- techniques for outlier (noise) detection. These researchers have identified the
techniques’ advantages and disadvantages. Instance selection is not only used to
---------------------- handle noise but to cope with the infeasibility of learning from very large datasets.
Instance selection in these datasets is an optimization problem that attempts to
----------------------
maintain the mining quality while minimizing the sample size. It reduces data
---------------------- and enables a data mining algorithm to function and work effectively with very
large datasets. There is a variety of procedures for sampling instances from a
---------------------- large dataset. See figure 6.2 below.
---------------------- Feature subset selection is the process of identifying and removing as many
irrelevant and redundant features as possible. This reduces the dimensionality
---------------------- of the data and enables data mining algorithms to operate faster and more
effectively. The fact that many features depend on one another often unduly
----------------------
influences the accuracy of supervised ML classification models. This problem
---------------------- can be addressed by constructing new features from the basic feature set. This
technique is called feature construction/transformation. These newly generated
---------------------- features may lead to the creation of more concise and accurate classifiers.
---------------------- In addition, the discovery of meaningful features contributes to better
comprehensibility of the produced classifier, and a better understanding of
---------------------- the learned concept. Speech recognition using hidden Markov models and
Bayesian networks relies on some elements of supervision as well in order to
----------------------
adjust parameters to, as usual, minimize the error on the given inputs. Notice
---------------------- something important here: in the classification problem, the goal of the learning
algorithm is to minimize the error with respect to the given inputs. These inputs,
---------------------- often called the “training set”, are the examples from which the agent tries to
learn. But learning the training set well is not necessarily the best thing to do.
----------------------
For instance, if I tried to teach you exclusive-or, but only showed you
---------------------- combinations consisting of one true and one false, but never both false or both
---------------------- true, you might learn the rule that the answer is always true. Similarly, with
machine learning algorithms, a common problem is over-fitting the data and
---------------------- essentially memorizing the training set rather than learning a more general
classification technique. As you might imagine, not all training sets have the
---------------------- inputs classified correctly. This can lead to problems if the algorithm used is
---------------------- powerful enough to memorize even the apparently “special cases” that don’t
fit the more general principles. This, too, can lead to over fitting, and it is a
---------------------- challenge to find algorithms that are both powerful enough to learn complex
functions and robust enough to produce generalisable results.
----------------------
84 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 6.2: Machine Learning Supervise Process
----------------------
6.3 ALGORITHM TYPES
----------------------
In the area of supervised/unsupervised learning, following are the
----------------------
algorithms types:
●● Linear Classifiers ----------------------
Logical Regression ----------------------
Naïve Bayes
----------------------
Classifier Perceptron
Support Vector Machine ----------------------
●● Quadratic Classifiers ----------------------
●● K-Means Clustering
----------------------
●● Boosting
----------------------
●● Decision Tree
Random Forest ----------------------
●● Neural networks ----------------------
●● Bayesian Networks
----------------------
In this unit, we shall explain four machine learning techniques with their
examples and how they perform in reality. These are: ----------------------
●● k-nearest neighbours algorithm ----------------------
Supervised Machine Learning 85
Notes ●● Naïve Bayes Classifier
●● Decision Tree
----------------------
●● Support Vector Machine
---------------------- Linear Classifiers
---------------------- In machine learning, the goal of classification is to group items that have
similar feature values, into groups. Timothy et al (Timothy Jason Shepard, 1998)
----------------------
stated that a linear classifier achieves this by making a classification decision
---------------------- based on the value of the linear combination of the features. If the input feature
vector to the classifier is a real vector , then the output score is
----------------------
----------------------
---------------------- where is a real vector of weights and f is a function that converts the dot
product of the two vectors into the desired output. The weight vector is
---------------------- learned from a set of labelled training samples. Often f is a simple function
that maps all values above a certain threshold to the first class and all other
----------------------
values to the second class. A more complex f might give the probability that
---------------------- an item belongs to a certain class. For a two-class classification problem, one
can visualize the operation of a linear classifier as splitting a high-dimensional
---------------------- input space with a hyperplane: all points on one side of the hyper plane are
classified as “yes”, while the others are classified as “no”. A linear classifier is
----------------------
often used in situations where the speed of classification is an issue, since it is
---------------------- often the fastest classifier, especially when is sparse. However, decision trees
can be faster. Also, linear classifiers often work very well when the number of
---------------------- dimensions in is large, as in document classification, where each element in
---------------------- is typically the number of counts of a word in a document. In such cases, the
classifier should be well regularized.
----------------------
6.3.1 K-Nearest-Neighbours (KNN) Algorithm
---------------------- The KNN algorithm is a robust and versatile classifier that is often used
as a benchmark for more complex classifiers such as Artificial Neural Networks
----------------------
(ANN) and Support Vector Machines (SVM). Despite its simplicity, KNN can
---------------------- outperform more powerful classifiers and is used in a variety of applications
such as economic forecasting, data compression and genetics.
----------------------
Let’s first start by establishing some definitions and notations. We will use
---------------------- x to denote a feature (aka. predictor, attribute) and y to denote the target (aka.
label, class) we are trying to predict.
----------------------
KNN falls in the supervised learning family of algorithms. Informally, this
---------------------- means that we are given a labelled dataset consisting of training observations (x,
y) and would like to capture the relationship between x and y. More formally,
---------------------- our goal is to learn a function h: X→Y so that given an unseen observation x,
---------------------- h(x) can confidently predict the corresponding output y.
The KNN classifier is also a non-parametric and instance-based learning
---------------------- algorithm.
86 Introduction to Data Science, Machine Learning & AI
Non-parametric means it makes no explicit assumptions about the Notes
functional form of h, avoiding the dangers of mis-modeling the underlying
distribution of the data. For example, suppose our data is highly non-Gaussian ----------------------
but the learning model we choose assumes a Gaussian form. In that case, our
algorithm would make extremely poor predictions. ----------------------
Instance-based learning means that our algorithm doesn’t explicitly ----------------------
learn a model. Instead, it chooses to memorize the training instances which are
----------------------
subsequently used as “knowledge” for the prediction phase. Concretely, this
means that only when a query to our database is made (i.e. when we ask it to ----------------------
predict a label given an input), will the algorithm use the training instances to
spit out an answer. ----------------------
In the classification setting, the K-nearest neighbour algorithm essentially ----------------------
boils down to forming a majority vote between the K most similar instances to a
given “unseen” observation. Similarity is defined according to a distance metric ----------------------
between two data points. A popular choice is the Euclidean distance given by
----------------------
----------------------
----------------------
but other measures can be more suitable for a given setting and include the
Manhattan, Chebyshev and Hamming distance. ----------------------
More formally, given a positive integer K, an unseen observation x and a ----------------------
similarity metric d, KNN classifier performs the following two steps:
----------------------
It runs through the whole dataset computing d between x and each training
observation. We’ll call the K points in the training data that are closest to ----------------------
x the set A. Note that K is usually odd to prevent tie situations.
----------------------
It then estimates the conditional probability for each class, that is, the
fraction of points in A with that given class label. (Note I(x) is the ----------------------
indicator function which evaluates to 1 when the argument x is true and 0
otherwise) ----------------------
----------------------
----------------------
Finally, our input x gets assigned to the class with the largest probability. ----------------------
At this point, you’re probably wondering how to pick the variable K
and what its effects are on your classifier. Well, like most machine learning ----------------------
algorithms, the K in KNN is a hyper-parameter that you, as a designer, must ----------------------
pick in order to get the best possible fit for the data set. Intuitively, you can think
of K as controlling the shape of the decision boundary we talked about earlier. ----------------------
When K is small, we are restraining the region of a given prediction and ----------------------
forcing our classifier to be “more blind” to the overall distribution. A small
value for K provides the most flexible fit, which will have low bias but high ----------------------
variance. Graphically, our decision boundary will be more jagged. On the other
----------------------
hand, a higher K averages more voters in each prediction and hence is more
Supervised Machine Learning 87
Notes resilient to outliers. Larger values of K will have smoother decision boundaries
which means lower variance but increased bias.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- For example, suppose a k-NN algorithm was given an input of data points
of specific men and women’s weight and height, as plotted below. To determine
---------------------- the gender of an unknown input (green point), k-NN can look at the nearest k
neighbours (suppose k=3) and will determine that the input’s gender is male.
----------------------
This method is a very simple and logical way of marking unknown inputs, with
---------------------- a high rate of success.
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- K-nearest neighbours can be used in classification or regression machine
---------------------- learning tasks. Classification involves placing input points into appropriate
categories whereas regression involves establishing a relationship between
---------------------- input points and the rest of the data. In either of these cases, determining a
neighbour can be performed using many different notions of distance, with
---------------------- the most common being Euclidean and Hamming distance. Euclidean distance
---------------------- is the most popular notion of distance--the length of a straight line between
two points. Hamming distance is the same concept, but for strings distance is
---------------------- calculated as the number of positions where two strings differ. Furthermore,
for certain multivariable tasks, distances must be normalized (or weighted)
---------------------- to accurately represent the correlation between variables and their strength of
---------------------- correlation.
The KNN Algorithm
----------------------
1. Load the data
----------------------
2. Initialize K to your chosen number of neighbours
----------------------
88 Introduction to Data Science, Machine Learning & AI
3. For each example in the data Notes
3.1 Calculate the distance between the query example and the current
----------------------
example from the data.
3.2 Add the distance and the index of the example to an ordered ----------------------
collection
----------------------
4. Sort the ordered collection of distances and indices from smallest to
largest (in ascending order) by the distances ----------------------
5. Pick the first K entries from the sorted collection ----------------------
6. Get the labels of the selected K entries ----------------------
7. If regression, return the mean of the K labels
----------------------
8. If classification, return the mode of the K labels
----------------------
For k-NN classification, an input is classified by a majority vote of
its neighbours. That is, the algorithm obtains the class membership of its k ----------------------
neighbours and outputs the class that represents a majority of the k neighbours.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
An example of k-NN classification
----------------------
Suppose we are trying to classify the green circle. Let us begin with
k=3k=3 (the solid line). In this case, the algorithm would return a red triangle, ----------------------
since it constitutes a majority of the 3 neighbours. Likewise, with k=5 (the
dotted line), the algorithm would return a blue square. ----------------------
If no majority is reached with the k neighbours, many courses of action ----------------------
can be taken. For example, one could use a plurality system or even use a
different algorithm to determine the membership of that data point. ----------------------
K-NN regression works in a similar manner. The value returned is the ----------------------
average value of the input’s k neighbours.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Supervised Machine Learning 89
Notes Suppose we have data points from a sine wave above (with some variance,
of course) and our task is to produce a y value for a given x value. When given an
---------------------- input data point to classify, k-NN would return the average y value of the input’s
k neighbours. For example, if k-NN were asked to return the corresponding y
---------------------- value for x=0, the algorithm would find the k nearest points to x=0 and return
---------------------- the average y value corresponding to these k points. This algorithm would be
simple, but very successful for most x values.
----------------------
Pros and Cons
---------------------- k-NN is one of many algorithms used in machine learning tasks, in
fields such as computer vision and gene expression analysis. So why use k-NN
----------------------
over other algorithms? The following is a list of pros and cons k-NN has over
---------------------- alternatives.
Pros:
----------------------
Very easy to understand and implement. A k-NN implementation does
---------------------- not require much code and can be a quick and simple way to begin machine
---------------------- learning datasets.
Does not assume any probability distributions on the input data. This can
---------------------- come in handy for inputs where the probability distribution is unknown and is
---------------------- therefore robust.
Can quickly respond to changes in input. k-NN employs lazy learning,
---------------------- which generalizes during testing--this allows it to change during real-time use.
---------------------- Cons:
---------------------- Sensitive to localized data. Since k-NN gets all of its information from
the input’s neighbours, localized anomalies affect outcomes significantly, rather
---------------------- than for an algorithm that uses a generalized view of the data.
---------------------- Computation time. Lazy learning requires that most of k-NN’s computation
be done during testing, rather than during training. This can be an issue for large
---------------------- datasets.
---------------------- Normalization. If one type of category occurs much more than another,
classifying an input will be more biased towards that one category (since it is
---------------------- more likely to be neighbours with the input). This can be mitigated by applying
---------------------- a lower weight to more common categories and a higher weight to less common
categories; however, this can still cause errors near decision boundaries.
---------------------- Dimensions. In the case of many dimensions, inputs can commonly be
---------------------- “close” to many data points. This reduces the effectiveness of k-NN, since
the algorithm relies on a correlation between closeness and similarity. One
---------------------- workaround for this issue is dimension reduction, which reduces the number of
working variable dimensions (but can lose variable trends in the process).
----------------------
6.3.2 Naïve Bayes Classifier
----------------------
A Naive Bayes classifier is a simple probabilistic classifier based on
---------------------- applying Bayes’ theorem (from Bayesian statistics) with strong (naive)
90 Introduction to Data Science, Machine Learning & AI
independence assumptions. A more descriptive term for the underlying Notes
probability model would be “independent feature model”.
----------------------
In simple terms, a naive Bayes classifier assumes that the presence
(or absence) of a particular feature of a class is unrelated to the presence (or ----------------------
absence) of any other feature. For example, a fruit may be considered to be
an apple if it is red, round, and about 4” in diameter. Even if these features ----------------------
depend on each other or upon the existence of the other features, a naive Bayes
----------------------
classifier considers all of these properties to independently contribute to the
probability that this fruit is an apple. ----------------------
Depending on the precise nature of the probability model, naive Bayes
----------------------
classifiers can be trained very efficiently in a supervised learning setting. In
many practical applications, parameter estimation for naive Bayes models ----------------------
uses the method of maximum likelihood; in other words, one can work with
the naive Bayes model without believing in Bayesian probability or using any ----------------------
Bayesian methods. In spite of their naive design and apparently over-simplified
----------------------
assumptions, naive Bayes classifiers have worked quite well in many complex
real-world situations. ----------------------
An advantage of the naive Bayes classifier is that it only requires a small
----------------------
amount of training data to estimate the parameters (means and variances of
the variables) necessary for classification. Because independent variables are ----------------------
assumed, only the variances of the variables for each class need to be determined
and not the entire covariance matrix. ----------------------
The naive Bayes probabilistic model ----------------------
Abstractly, the probability model for a classifier is a conditional model ----------------------
----------------------
over a dependent class variable with a small number of outcomes or classes, ----------------------
conditional on several feature variables F1 through Fn. The problem is that if the
number of features n is large or when a feature can take on a large number of ----------------------
values, then basing such a model on probability tables is infeasible. We therefore ----------------------
reformulate the model to make it more tractable. Using Bayes’ theorem, we
write ----------------------
----------------------
----------------------
In plain English the above equation can be written as ----------------------
----------------------
----------------------
In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features Fi are given, ----------------------
so that the denominator is effectively constant. The numerator is equivalent to
the joint probability model ----------------------
Supervised Machine Learning 91
Notes
----------------------
which can be rewritten as follows, using repeated applications of the definition
---------------------- of conditional probability:
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Now the “naive” conditional independence assumptions come into play:
assume that each feature Fi is conditionally independent of every other feature
---------------------- Fj for j not equal to i. This means that
----------------------
----------------------
for , and so the joint model can be expressed as
----------------------
----------------------
----------------------
----------------------
This means that under the above independence assumptions, the
---------------------- conditional distribution over the class variable C can be expressed like this:
----------------------
----------------------
---------------------- where Z (the evidence) is a scaling factor dependent only on F1, …. Fn , i.e., a
---------------------- constant if the values of the feature variables are known.
---------------------- Models of this form are much more manageable, since they factor into a
so-called class prior and independent probability distributions . If
---------------------- there are k classes and if a model for each can be expressed in terms
of r parameters, then the corresponding naive Bayes model has (k − 1) + n r
----------------------
k parameters. In practice, often k=2 (binary classification) and r=1 (Bernoulli
---------------------- variables as features) are common, and so the total number of parameters of the
naive Bayes model is 2n+1 , where n is the number of binary features used for
---------------------- classification and prediction.
----------------------
----------------------
92 Introduction to Data Science, Machine Learning & AI
Parameter estimation Notes
All model parameters (i.e., class priors and feature probability
----------------------
distributions) can be approximated with relative frequencies from the training
set. These are maximum likelihood estimates of the probabilities. A class’ ----------------------
prior may be calculated by assuming equi-probable classes (i.e., priors = 1 /
(number of classes)), or by calculating an estimate for the class probability ----------------------
from the training set (i.e., (prior for a given class) = (number of samples in the
----------------------
class) / (total number of samples)). To estimate the parameters for a feature’s
distribution, one must assume a distribution or generate nonparametric models ----------------------
for the features from the training set. If one is dealing with continuous data, a
typical assumption is that the continuous values associated with each class are ----------------------
distributed according to a Gaussian distribution.
----------------------
For example, suppose the training data contains a continuous attribute, x.
We first segment the data by the class, and then compute the mean and variance ----------------------
of x in each class. Let be the mean of the values in x associated with class c,
----------------------
and let be the variance of the values in x associated with class c.
Then, the probability of some value given a class, , can be ----------------------
computed by plugging into the equation for a Normal distribution parameterized
----------------------
by and . That is,
----------------------
----------------------
Another common technique for handling continuous values is to use ----------------------
binning to discretize the values. In general, the distribution method is a better
choice if there is a small amount of training data, or if the precise distribution of ----------------------
the data is known. The discretization method tends to do better if there is a large
----------------------
amount of training data because it will learn to fit the distribution of the data.
Since naive Bayes is typically used when a large amount of data is available (as ----------------------
more computationally expensive models can generally achieve better accuracy),
the discretization method is generally preferred over the distribution method. ----------------------
Sample correction ----------------------
If a given class and feature value never occur together in the training set ----------------------
then the frequency-based probability estimate will be zero. This is problematic
since it will wipe out all information in the other probabilities when they ----------------------
are multiplied. It is therefore often desirable to incorporate a small-sample
correction in all probability estimates such that no probability is ever set to be ----------------------
exactly zero. ----------------------
Constructing a classifier from the probability model
----------------------
The discussion so far has derived the independent feature model, that is,
the naive Bayes probability model. The naive Bayes classifier combines this ----------------------
model with a decision rule. One common rule is to pick the hypothesis that is
most probable; this is known as the maximum a posteriori or MAP decision ----------------------
rule. The corresponding classifier is the function classify defined as follows: ----------------------
Supervised Machine Learning 93
Notes
----------------------
---------------------- Despite the fact that the far-reaching independence assumptions
are often inaccurate, the naive Bayes classifier has several properties that
---------------------- make it surprisingly useful in practice. In particular, the decoupling of the
---------------------- class conditional feature distributions means that each distribution can be
independently estimated as a one dimensional distribution. This in turn helps
---------------------- to alleviate problems stemming from the curse of dimensionality, such as the
need for data sets that scale exponentially with the number of features. Like all
---------------------- probabilistic classifiers under the MAP decision rule, it arrives at the correct
---------------------- classification as long as the correct class is more probable than any other class;
hence class probabilities do not have to be estimated very well. In other words,
---------------------- the overall classifier is robust enough to ignore serious deficiencies in its
underlying naive probability model.
----------------------
Examples
----------------------
Sex Classification Problem: classify whether a given person is a male or
---------------------- a female based on the measured features. The features include height, weight,
and foot size. Training Example training set below.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- The classifier created from the training set using a Gaussian distribution
assumption would be:
----------------------
----------------------
----------------------
---------------------- Let’s say we have equi-probable classes so P(male)= P(female) = 0.5.
There was no identified reason for making this assumption so it may have been
---------------------- a bad idea. If we determine P(C) based on frequency in the training set, we
happen to get the same answer.
----------------------
----------------------
----------------------
94 Introduction to Data Science, Machine Learning & AI
Testing Notes
Below is a sample to be classified as a male or female.
----------------------
sex height (feet) weight (lbs) foot size(inches)
sample 6 130 8 ----------------------
We wish to determine which posterior is greater, male or female. ----------------------
posterior (male) = P(male)*P(height|male)*P(weight|male)*P(foot size|male) /
----------------------
evidence
posterior (female) = P(female)*P(height|female)*P(weight|female)*P(foot ----------------------
size|female) / evidence ----------------------
The evidence (also termed normalizing constant) may be calculated since the
sum of the posteriors equals one. ----------------------
evidence = P(male)*P(height|male)*P(weight|male)*P(foot size|male) + ----------------------
P(female)*P(height|female)*P(weight|female)*P(foot size|female) ----------------------
The evidence may be ignored since it is a positive constant. (Normal distributions
----------------------
are always positive.) We now determine the sex of the sample.
P(male) = 0.5 ----------------------
P(height|male) = 1.5789 (A probability density greater than 1 is OK. It is the ----------------------
area under the bell curve that is equal to 1.)
----------------------
P(weight|male) = 5.9881e-06
----------------------
P(foot size|male) = 1.3112e-3
posterior numerator (male) = their product = 6.1984e-09 ----------------------
P(female) = 0.5 ----------------------
P(height|female) = 2.2346e-1 ----------------------
P(weight|female) = 1.6789e-2
----------------------
P(foot size|female) = 2.8669e-1
----------------------
posterior numerator (female) = their product = 5.3778e-04
Since posterior numerator (female) > posterior numerator (male), the sample is ----------------------
female.
----------------------
6.3.3 Decision Tree
----------------------
Decision Tree learning algorithm generates decision trees from the
training data to solve classification and regression problem. Consider you would ----------------------
like to go out for game of Tennis outside. Now the question is how one would
decide whether it is ideal to go out for a game of tennis. Now this depends ----------------------
upon various factors like time, weather, temperature etc. We call these factors ----------------------
as features which will influence our decision. If you could record all the factors
and decision you took, you could get a table something like this. ----------------------
----------------------
Supervised Machine Learning 95
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- With this table, other people would be able to use your intuition to decide
whether they should play tennis by looking up what you did given a certain
----------------------
weather pattern, but after just 14 days, it’s a little unwieldy to match your
---------------------- current weather situation with one of the rows in the table. We could represent
this tabular data in the form of tree.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Here all the information is represented in the form of tree. The rectangular
box represents the node of the tree. Splitting of data is done by asking question
----------------------
to the node. The branches represents various possible known outcome obtained
---------------------- by asking the question on the node. The end nodes are the leafs. They represent
various classes in which the data can be classified. The two classes in this
---------------------- example are Yes and No. Thus to obtain the class/final output, ask the question
to the node and using the answer travel through branch until one reaches the
----------------------
leaf node.
----------------------
96 Introduction to Data Science, Machine Learning & AI
Algorithm Notes
1. Start with a training data set which we’ll call S. It should have attributes
----------------------
and classification.
2. Determine the best attribute in the dataset. (We will go over the definition ----------------------
of best attribute)
----------------------
3. Split S into subset that contains the possible values for the best attribute.
----------------------
4. Make decision tree node that contains the best attribute.
5. Recursively generate new decision trees by using the subset of data ----------------------
created from step 3 until a stage is reached where you cannot classify the ----------------------
data further. Represent the class as leaf node.
Deciding the “BEST ATTRIBUTE” ----------------------
Now the most important part of Decision Tree algorithm is deciding the ----------------------
best attribute. But what does ‘best’ actually mean?
----------------------
In Decision Tree algorithm, the best mean the attribute which has most
information gain. ----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
The left split has less information gain as the data is split on two classes ----------------------
which has almost equal ‘+’ and ‘-’ examples. While the split on the right as ----------------------
more ‘+’ example in one class and more ‘-’ example in the other class. In order
to calculate best attribute we will use Entropy. ----------------------
Entropy ----------------------
In machine learning sense and especially in this case Entropy is the
----------------------
measure of homogeneity in the data. Its value is ranges from 0 to 1. Its value is
close to 0 if all the example belongs to same class and is close to 1 is there is ----------------------
almost equal split of the data into different classes. Now the formula to calculate
entropy is: ----------------------
----------------------
----------------------
Here pi represents the proportion of the data with ith classification and c
represents the different types of classification. ----------------------
Now Information Gain measure the reduction in entropy by classifying ----------------------
the data on a particular attribute. The formula to calculate Gain by splitting the
data on Dataset ‘S’ and on the attribute ‘A’ is: ----------------------
Supervised Machine Learning 97
Notes
---------------------- Here Entropy(S) represents the entropy of the dataset and the second term
on the right is the weighted entropy of the different possible classes obtain
----------------------
after the split. Now the goal is to maximize this information gain. The attribute
---------------------- which has the maximum information gain is selected as the parent node and
successively data is split on the node.
----------------------
Entropy of the dataset (Entropy(S)) is:
----------------------
----------------------
Now to calculate Information Gain:
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- As we can see Outlook attribute has the maximum information gain and
hence it is placed at top of the tree.
----------------------
Problem with Continuous Data
---------------------- Now one question may arise is how the data is split in case of continuous
---------------------- data. Suppose there is attribute temperature which has values from 10 to 45
degree Celsius. We cannot make a split on every discrete value. What we can
---------------------- do in this case is to split the discrete values into continuous classes such as
98 Introduction to Data Science, Machine Learning & AI
10–20 can be class1, 20–30 and so on and a particular discrete value is put to a Notes
particular class.
----------------------
Avoiding overfitting the data
Now the main problem with decision tree is that it is prone to overfitting. ----------------------
We could create a tree that could classify the data perfectly or we are not left
----------------------
with any attribute to split. This would work well in on the training dataset but
will have a bad result on the testing dataset. There are two popular approaches ----------------------
to avoid this in decision trees: stop growing the tree before it becomes too large
or prune the tree after it becomes too large. Typically, a limit to a decision tree’s ----------------------
growth will be specified in terms of the maximum number of layers, or depth,
it’s allowed to have. The data available to train the decision tree will be split ----------------------
into a training set and test set and trees with various maximum depths will be ----------------------
created based on the training set and tested against the test set. Cross--validation
can be used as part of this approach as well. ----------------------
Tree pruning is a technique that leverages this splitting redundancy to ----------------------
remove i.e. prune the unnecessary splits in our tree. From a high-level, pruning
compresses part of the tree from strict and rigid decision boundaries into ----------------------
ones that are more smooth and generalise better, effectively reducing the tree
complexity. The complexity of a decision tree is defined as the number of splits ----------------------
in the tree. Pruning the tree, on the other hand, involves testing the original tree ----------------------
against pruned versions of it. Leaf nodes are taken away from the tree as long
as the pruned tree performs better against test data than the larger tree. ----------------------
Here are a few of the pro and cons of decision trees that can help you ----------------------
decide on whether or not it’s the right model for your problem, as well as some
tips as to how you can effectively apply them: ----------------------
Pros ----------------------
Easy to understand and interpret. At each node, we are able to see exactly
what decision our model is making. In practice we’ll be able to fully understand ----------------------
where our accuracies and errors are coming from, what type of data the model ----------------------
would do well with, and how the output is influenced by the values of the
features. ----------------------
Require very little data preparation. Many ML models may require ----------------------
heavy data pre-processing such as normalization and may require complex
regularisation schemes. Decision trees on the other hand work quite well out of ----------------------
the box after tweaking a few of the parameters.
----------------------
The cost of using the tree for inference is logarithmic in the number of
data points used to train the tree. That’s a huge plus since it means that having ----------------------
more data won’t necessarily make a huge dent in our inference speed.
----------------------
Cons
----------------------
Overfitting is quite common with decision trees simply due to the nature of
their training. It’s often recommended to perform some type of dimensionality ----------------------
reduction such as PCA so that the tree doesn’t have to learn splits on so many
features. ----------------------
Supervised Machine Learning 99
Notes For similar reasons as the case of overfitting, decision trees are also
vulnerable to becoming biased to the classes that have a majority in the dataset.
---------------------- It’s always a good idea to do some kind of class balancing such as class weights,
sampling, or a specialised loss function.
----------------------
6.3.4 Support Vector Machine
----------------------
A Support Vector Machine as stated by Luis et al (Luis Gonz, 2005)
---------------------- (SVM) performs classification by constructing an N dimensional hyper plane
that optimally separates the data into two categories. SVM models are closely
---------------------- related to neural networks. In fact, a SVM model using a sigmoid kernel
function is equivalent to a two layer, perceptron neural network. Support Vector
----------------------
Machine (SVM) models are a close cousin to classical multilayer perceptron
---------------------- neural networks. Using a kernel function, SVM’s are an alternative training
method for polynomial, radial basis function and multi-layer perceptron
---------------------- classifiers in which the weights of the network are found by solving a quadratic
programming problem with linear constraints, rather than by solving a non-
----------------------
convex, unconstrained minimization problem as in standard neural network
---------------------- training. In the parlance of SVM literature, a predictor variable is called an
attribute, and a transformed attribute that is used to define the hyper plane is
---------------------- called a feature. The task of choosing the most suitable representation is known
as feature selection. A set of features that describes one case (i.e., a row of
----------------------
predictor values) is called a vector. So the goal of SVM modelling is to find the
---------------------- optimal hyper plane that separates clusters of vector in such a way that cases
with one category of the target variable are on one side of the plane and cases
---------------------- with the other category are on the other size of the plane. The vectors near the
hyper plane are the support vectors. The figure below presents an overview of
----------------------
the SVM process.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- A Two-Dimensional Example
---------------------- Before considering N-dimensional hyper planes, let’s look at a simple
2-dimensional example. Assume we wish to perform a classification, and our
---------------------- data has a categorical target variable with two categories. Also assume that
there are two predictor variables with continuous values. If we plot the data
----------------------
points using the value of one predictor on the X axis and the other on the Y axis
---------------------- we might end up with an image such as shown below. One category of the target
variable is represented by rectangles while the other category is represented by
---------------------- ovals.
100 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
In this idealized example, the cases with one category are in the lower
left corner and the cases with the other category are in the upper right corner; ----------------------
the cases are completely separated. The SVM analysis attempts to find a
1-dimensional hyper plane (i.e. a line) that separates the cases based on their ----------------------
target categories. There are an infinite number of possible lines; two candidate ----------------------
lines are shown above. The question is which line is better, and how do we
define the optimal line. The dashed lines drawn parallel to the separating line ----------------------
mark the distance between the dividing line and the closest vectors to the line.
The distance between the dashed lines is called the margin. The vectors (points) ----------------------
that constrain the width of the margin are the support vectors. The following ----------------------
figure illustrates this.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
An SVM analysis (Luis Gonz, 2005) finds the line (or, in general,
hyper plane) that is oriented so that the margin between the support vectors ----------------------
is maximized. In the figure above, the line in the right panel is superior to the
line in the left panel. If all analyses consisted of two-category target variables ----------------------
with two predictor variables, and the cluster of points could be divided by a
----------------------
straight line, life would be easy. Unfortunately, this is not generally the case,
so SVM must deal with (a) more than two predictor variables, (b) separating ----------------------
the points with non-linear curves, (c) handling the cases where clusters cannot
be completely separated, and (d) handling classifications with more than two ----------------------
categories.
----------------------
----------------------
----------------------
----------------------
Supervised Machine Learning 101
Notes
Check your Progress 1
----------------------
Fill in the blanks.
----------------------
1. In the case of _________, the classification is used to determine the
---------------------- error of the network and then adjust the network to minimize it.
---------------------- 2. The _________ is also a non-parametric and instance-based learning
algorithm.
----------------------
3. _________ is the measure of homogeneity in the data.
---------------------- 4. _________ is a technique that leverages splitting redundancy to
remove the unnecessary splits in our tree.
----------------------
----------------------
Activity 1
----------------------
---------------------- Find and list the real examples where supervised machine algorithms are
used. Also state the purpose behind using supervised machine algorithms.
----------------------
----------------------
Summary
----------------------
●● In this unit we have discussed Supervised and Unsupervised learning. We
---------------------- have also described the supervised learning algorithms such as K-Nearest-
Neighbours (KNN), Naïve Bayes Classifier, Decision Tree and Support
----------------------
Vector Machine with example.
----------------------
Keywords
----------------------
●● Supervised Learning: Supervised learning is the machine learning task
----------------------
of learning a function that maps an input to an output based on example
---------------------- input-output pairs
●● Unsupervised Learning: Unsupervised learning is a machine learning
----------------------
technique, where you do not need to supervise the model. Instead, you
---------------------- need to allow the model to work on its own to discover information.
●● Artificial Neural Networks (ANN): It is computing systems that are
---------------------- inspired by, but not identical to, biological neural networks that constitute
---------------------- animal brains. Such systems “learn” to perform tasks by considering
examples, generally without being programmed with task-specific rules.
---------------------- ●● Classification: Classification is a process related to categorization, the
---------------------- process in which ideas and objects are recognized, differentiated and
understood.
---------------------- ●● Regression: It is a statistical measurement used in finance, investing, and
---------------------- other disciplines that attempts to determine the strength of the relationship
102 Introduction to Data Science, Machine Learning & AI
between one dependent variable (usually denoted by Y) and a series of Notes
other changing variables (known as independent variables).
----------------------
Self-Assessment Questions ----------------------
1. State the difference between Supervised and Unsupervised machine ----------------------
algorithms.
----------------------
2. Explain KNN algorithm with example.
3. What are the advantages and disadvantages of decision tree? ----------------------
----------------------
Answers to Check your Progress
----------------------
Check your Progress 1
----------------------
Fill in the blanks.
1. In the case of neural networks, the classification is used to determine the ----------------------
error of the network and then adjust the network to minimize it. ----------------------
2. The KNN classifier is also a non-parametric and instance-based learning
algorithm. ----------------------
3. Entropy is the measure of homogeneity in the data. ----------------------
4. Tree pruning is a technique that leverages splitting redundancy to remove ----------------------
the unnecessary splits in our tree.
----------------------
Suggested Reading ----------------------
1. Introduction to Machine Learning by Ethem Alpaydin ----------------------
2. Machine Learning For Beginners: Machine Learning Basics for Absolute ----------------------
Beginners. Learn What ML Is and Why It Matters. Notes on Artificial
Intelligence and Deep Learning are also Included, by Scott Chesterton ----------------------
3. Understanding Machine Learning: From Theory to Algorithms by Shai ----------------------
Shalev-Shwartz, Shai Ben-David
----------------------
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron
----------------------
----------------------
----------------------
The text is adapted by Symbiosis Centre for Distance Learning under a ----------------------
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
----------------------
as requested by the work’s creator or licensees. This license is available at
https://creativecommons.org/licenses/by-sa/4.0/. ----------------------
----------------------
Supervised Machine Learning 103
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
104 Introduction to Data Science, Machine Learning & AI
Unsupervised Learning
UNIT
7
Structure:
7.1 Introduction
7.2 Concept of Unsupervised learning
7.3 Unsupervised Learning Algorithms
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Unsupervised Learning 105
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Understand the concept of unsupervised machine learning algorithms
----------------------
• Explain the different unsupervised machine learning algorithms
----------------------
----------------------
7.1 INTRODUCTION
----------------------
In the last unit we have discussed in brief about supervised learning
---------------------- algorithms. In this unit, we are going to discuss unsupervised learning algorithms
---------------------- in detail with example.
---------------------- 7.2 CONCEPT OF UNSUPERVISED LEARNING
---------------------- With unsupervised learning, the goal is to have the computer learn
---------------------- how to do something that we don’t tell it how to do! There are actually two
approaches to unsupervised learning. The first approach is to teach the agent
---------------------- not by giving explicit categorizations, but by using some sort of reward system
to indicate success. Note that this type of training will generally fit into the
---------------------- decision problem framework because the goal is not to produce a classification
---------------------- but to make decisions that maximize rewards. This approach nicely generalizes
to the real world, where agents might be rewarded for doing certain actions and
---------------------- punished for doing others. Often, a form of reinforcement learning can be used
for unsupervised learning, where the agent bases its actions on the previous
---------------------- rewards and punishments without necessarily even learning any information
---------------------- about the exact ways that its actions affect the world.
In a way, all of this information is unnecessary because by learning a
---------------------- reward function, the agent simply knows what to do without any processing
---------------------- because it knows the exact reward it expects to achieve for each action it
could take. This can be extremely beneficial in cases where calculating every
---------------------- possibility is very time consuming (even if all of the transition probabilities
between world states were known). On the other hand, it can be very time
---------------------- consuming to learn by, essentially, trial and error. But this kind of learning can
---------------------- be powerful because it assumes no pre-discovered classification of examples. In
some cases, for example, our classifications may not be the best possible.
---------------------- One striking example is that the conventional wisdom about the game
---------------------- of backgammon was turned on its head when a series of computer programs
(neuro-gammon and TD-gammon) that learned through unsupervised learning
---------------------- became stronger than the best human chess players merely by playing themselves
over and over. These programs discovered some principles that surprised the
---------------------- backgammon experts and performed better than backgammon programs trained
---------------------- on pre-classified examples.
106 Introduction to Data Science, Machine Learning & AI
A second type of unsupervised learning is called clustering. In this Notes
type of learning, the goal is not to maximize a utility function, but simply to
find similarities in the training data. The assumption is often that the clusters ----------------------
discovered will match reasonably well with an intuitive classification. For
instance, clustering individuals based on demographics might result in a ----------------------
clustering of the wealthy in one group and the poor in another. Although the ----------------------
algorithm won’t have names to assign to these clusters, it can produce them
and then use those clusters to assign new examples into one or the other of ----------------------
the clusters. This is a data-driven approach that can work well when there is
sufficient data; for instance, social information filtering algorithms, such as ----------------------
those that Amazon.com use to recommend books, are based on the principle of ----------------------
finding similar groups of people and then assigning new users to groups.
----------------------
In some cases, such as with social information filtering, the information
about other members of a cluster (such as what books they read) can be sufficient ----------------------
for the algorithm to produce meaningful results. In other cases, it may be the
case that the clusters are merely a useful tool for a human analyst. Unfortunately, ----------------------
even unsupervised learning suffers from the problem of overfitting the training
----------------------
data. There’s no silver bullet to avoiding the problem because any algorithm
that can learn from its inputs needs to be quite powerful. Lack of robustness is ----------------------
known as over fitting from the statistics and the machine learning literature.
----------------------
Unsupervised learning has produced many successes, such as world-
champion calibre backgammon easy way to assign values to actions. Clustering ----------------------
can be useful when there is enough programs and even machines capable of
driving cars! It can be a powerful technique when there is an data to form clusters ----------------------
(though this turns out to be difficult at times) and especially when additional
----------------------
data about members of a cluster can be used to produce further results due to
dependencies in the data. ----------------------
Classification learning is powerful when the classifications are known
----------------------
to be correct (for instance, when dealing with diseases, it’s generally straight-
forward to determine the design after the fact by an autopsy), or when the ----------------------
classifications are simply arbitrary things that we would like the computer to
be able to recognize for us. Classification learning is often necessary when the ----------------------
decisions made by the algorithm will be required as input somewhere else.
----------------------
Otherwise, it wouldn’t be easy for whoever requires that input to figure out
what it means. Both techniques can be valuable and which one you choose ----------------------
should depend on the circumstances--what kind of problem is being solved,
how much time is allotted to solving it (supervised learning or clustering is ----------------------
often faster than reinforcement learning techniques), and whether supervised
----------------------
learning is even possible.
----------------------
7.3 UNSUPERVISED LEARNING ALGORITHMS
----------------------
In the above section we have discussed about unsupervised learning in
----------------------
detail. Now let’s discuss following two unsupervised algorithms in details:
----------------------
Unsupervised Learning 107
Notes 1. K-Means Clustering
2. Apriori Algorithms
----------------------
3. Self-Organized Map
----------------------
7.3.1 K-Means Clustering:
---------------------- The basic step of k-means clustering is uncomplicated. In the beginning
---------------------- we determine number of cluster K and we assume the centre of these clusters.
We can take any random objects as the initial centre or the first K objects in
---------------------- sequence can also serve as the initial centre. Then the K means algorithm will
do the three steps below until convergence. Iterate until stable (= no object
---------------------- move group):
---------------------- 1. Determine the centre coordinate
---------------------- 2. Determine the distance of each object to the centre
3. Group the object based on minimum distance
----------------------
The Figure 7.1 shows a K-means flow diagram
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Fig. 7.1: K-means iteration
---------------------- K-means (Bishop C. M., 1995) and (Tapas Kanungo, 2002) is one of the
simplest unsupervised learning algorithms that solve the well-known clustering
---------------------- problem. The procedure follows a simple and easy way to classify a given data
set through a certain number of clusters (assume k clusters) fixed a priori. The
---------------------- main idea is to define k centroids, one for each cluster. These centroids should
---------------------- be placed in a cunning way because of different location causes different result.
So, the better choice is to place them as much as possible far away from each
---------------------- other.
---------------------- The next step is to take each point belonging to a given data set and
associate it to the nearest centroid. When no point is pending, the first step is
---------------------- completed and an early groupage is done. At this point we need to re-calculate
108 Introduction to Data Science, Machine Learning & AI
k new centroids as barycentre’s of the clusters resulting from the previous step. Notes
After we have these k new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has been generated. As ----------------------
a result of this loop we may notice that the k centroids change their location step
by step until no more changes are done. In other words centroids do not move ----------------------
any more. Finally, this algorithm aims at minimizing an objective function, in ----------------------
this case a squared error function. The objective function
----------------------
----------------------
----------------------
where is a chosen distance measure between a data point and the
----------------------
cluster centre , is an indicator of the distance of the n data points from their
respective cluster centres. The algorithm is composed of the following steps: ----------------------
1. Place K points into the space represented by the objects that are being
----------------------
clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid. ----------------------
3. When all objects have been assigned, recalculate the positions of the K ----------------------
centroids.
----------------------
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces
a separation of the objects into groups from which the metric to be ----------------------
minimized can be calculated.
----------------------
Although it can be proved that the procedure will always terminate, the
k-means algorithm does not necessarily find the most optimal configuration, ----------------------
corresponding to the global objective function minimum. The algorithm is
----------------------
also significantly sensitive to the initial randomly selected cluster centres. The
k-means algorithm can be run multiple times to reduce this effect. K-means is ----------------------
a simple algorithm that has been adapted to many problem domains. As we are
going to see, it is a good candidate for extension to work with fuzzy feature ----------------------
vectors. An example, Suppose that we have n sample feature vectors x1, x2, ...,
----------------------
xn all from the same class, and we know that they fall into k compact clusters,
k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well ----------------------
separated, we can use a minimum-distance classifier to separate them. That is,
we can say that x is in cluster i if || x - mi || is the minimum of all the k distances. ----------------------
This suggests the following procedure for finding the k means:
----------------------
●● Make initial guesses for the means m1, m2, ..., mk
----------------------
●● Until there are no changes in any mean
●● Use the estimated means to classify the samples into clusters ----------------------
●● For i from 1 to k ----------------------
●● Replace mi with the mean of all of the samples for cluster i
----------------------
●● end_for
●● end_until ----------------------
Unsupervised Learning 109
Notes Here is an example showing how the means m1 and m2 move into the
centers of two clusters.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
This is a simple version of the k-means procedure. It can be viewed
---------------------- as a greedy algorithm for partitioning the n samples into k clusters so as to
---------------------- minimize the sum of the squared distances to the cluster centers. It does have
some weaknesses:
---------------------- ●● The way to initialize the means was not specified. One popular way to
---------------------- start is to randomly choose k of the samples.
●● The results produced depend on the initial values for the means, and it
---------------------- frequently happens that suboptimal partitions are found. The standard
---------------------- solution is to try a number of different starting points.
●● It can happen that the set of samples closest to mi is empty, so that mi
---------------------- cannot be updated. This is an annoyance that must be handled in an
---------------------- implementation, but that we shall ignore.
●● The results depend on the metric used to measure || x - mi ||. A popular
---------------------- solution is to normalize each variable by its standard deviation, though
---------------------- this is not always desirable.
●● The results depend on the value of k.
----------------------
This last problem is particularly troublesome, since we often have no way
---------------------- of knowing how many clusters exist. In the example shown above, the same
algorithm applied to the same data produces the following 3-means clustering.
---------------------- Is it better or worse than the 2-means clustering?
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Unfortunately there is no general theoretical solution to find the optimal
---------------------- number of clusters for any given data set. A simple approach is to compare
110 Introduction to Data Science, Machine Learning & AI
the results of multiple runs with different k classes and choose the best one Notes
according to a given criterion.
----------------------
7.3.2 Apriori Algorithms:
Apriori is an algorithm proposed by R. Agrawal and R Srikant in 1993 for ----------------------
mining frequent item sets for boolean association rule. The name of algorithm
----------------------
is based on the fact that the algorithm uses prior knowledge of frequent item set
properties. Apriori employs an iterative approach known as level-wise search, ----------------------
where k item set are used to explore (k+1) item sets. There are two steps in
each iteration. The first step generates a set of candidate item sets. Then, in the ----------------------
second step the occurrence of each candidate set in database is counted and
----------------------
then pruning of all disqualified candidates (i.e. all infrequent item sets) is done.
Apriori uses two pruning technique, first on the bases of support count (should ----------------------
be greater than user specified support threshold) and second for an item set
to be frequent , all its subset should be in last frequent item set The iterations ----------------------
begin with size 2 item sets and the size is incremented after each iteration. The
----------------------
algorithm is based on the closure property of frequent item sets: if a set of items
is frequent, then all its proper subsets are also frequent. This algorithm is easy ----------------------
to implement and parallelized but it has the major disadvantage that it requires
various scans of databases and is memory resident. ----------------------
The frequent item sets determined by Apriori can be used to determine ----------------------
association rules which highlight general trends in the database. This has
applications in domains such as market basket analysis. ----------------------
Key Concepts: ----------------------
●● Frequent Itemsets: The sets of item which has minimum support (denoted ----------------------
by Li for ith -Itemset).
●● Apriori Property: Any subset of frequent itemset must be frequent. ----------------------
●● Join Operation: To find Lk, a set of candidate k-itemsets is generated by ----------------------
joining Lk-1 with itself.
----------------------
The pseudo code for the algorithm is given below for a transaction database
T, and a support threshold of ε. Usual set theoretic notation is employed, though ----------------------
note that T is a multiset. Ck is the candidate set for level k. At each step, the
algorithm is assumed to generate the candidate sets from the large item sets of ----------------------
the preceding level, heeding the downward closure lemma. count[c] accesses
----------------------
a field of the data structure that represents candidate set c, which is initially
assumed to be zero. Usually the most important part of the implementation ----------------------
is the data structure used for storing the candidate sets, and counting their
frequencies. ----------------------
----------------------
----------------------
----------------------
----------------------
Unsupervised Learning 111
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
---------------------- Example:
---------------------- Consider a database, D, consisting of 9 transactions. Suppose min. support
count required is 2 (i.e. min_sup = 2/9 = 22%). Let minimum confidence
---------------------- required is 70%. We have to first find out the frequent itemset using Apriori
algorithm. Then, Association rules will be generated using min. support & min.
----------------------
confidence.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Step 1: Generating 1-itemset Frequent Pattern
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
112 Introduction to Data Science, Machine Learning & AI
●● The set of frequent 1-itemsets, L1, consists of the candidate 1- itemsets Notes
satisfying minimum support.
●● In the first iteration of the algorithm, each item is a member of the set of ----------------------
candidate. ----------------------
Step 2: Generating 2-itemset Frequent Pattern
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join
L1 to generate a candidate set of 2-itemsets, C2. ----------------------
Next, the transactions in D are scanned and the support count for each ----------------------
candidate itemset in C2 is accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2, is then determined, consisting of those ----------------------
candidate 2-itemsets in C2 having minimum support. (Note: We haven’t used ----------------------
Apriori Property yet)
----------------------
Step 3: Generating 3-itemset Frequent Pattern
----------------------
----------------------
----------------------
----------------------
The generation of the set of candidate 3-itemsets, C3, involves use of the
----------------------
Apriori Property. In order to find C3, we compute L2 Join L2.
C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, ----------------------
{I2, I4, I5}}.
----------------------
Now, Join step is complete and Prune step will be used to reduce the size
of C3. Prune step helps to avoid heavy computation due to large Ck. ----------------------
Based on the Apriori property that all subsets of a frequent itemset must ----------------------
also be frequent, we can determine that four latter candidates cannot possibly
be frequent. How? ----------------------
Unsupervised Learning 113
Notes For example, let’s take {I1, I2, I3}. The 2-item subsets of it are {I1, I2},
{I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2,
---------------------- We will keep {I1, I2, I3} in C3.
---------------------- Let’s take another example of {I2, I3, I5} which shows how the pruning is
performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. But, {I3, I5} is
---------------------- not a member of L2 and hence it is not frequent violating Apriori Property. Thus
we will have to remove {I2, I3, I5} from C3.Therefore, C3 = {{I1, I2, I3}, {I1,
----------------------
I2, I5}} after checking for all members of result of Join operation for Pruning.
---------------------- Now, the transactions in D are scanned in order to determine L3, consisting
of those candidates 3-itemsets in C3 having minimum support.
----------------------
Step 4: Generating 4-itemset Frequent Pattern
----------------------
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets,
---------------------- C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its
subset {{I2, I3, I5}} is not frequent.
----------------------
Thus, C4 = φ, and algorithm terminates, having found all of the frequent
---------------------- items. This completes our Apriori Algorithm.
---------------------- What’s Next? These frequent itemsets will be used to generate strong
association rules (where strong association rules satisfy both minimum support
---------------------- & minimum confidence).
---------------------- Step 5: Generating Association Rules from Frequent Itemsets
Procedure:
----------------------
●● For each frequent itemset “l”, generate all nonempty subsets of l.
----------------------
●● For every nonempty subset s of l, output the rule “s -> (l-s)” if support_
---------------------- count(l) / support_count(s) >= min_conf where min_conf is minimum
confidence threshold.
----------------------
Back To Example:
---------------------- We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},
{I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
----------------------
– Let’s take l = {I1, I2, I5}.
----------------------
– Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
---------------------- Let minimum confidence threshold is, say 70%. The resulting association rules
---------------------- are shown below, each listed with its confidence.
–R1:I1^I2->I5
----------------------
●● Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
----------------------
●● R1 is Rejected.
---------------------- –R2:I1^I5->I2
---------------------- ●● Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
●● R2 is Selected.
----------------------
114 Introduction to Data Science, Machine Learning & AI
–R3:I2^I5->I1 Notes
●● Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
----------------------
●● R3 is Selected.
–R4:I1->I2^I5 ----------------------
●● Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% ----------------------
●● R4 is Rejected.
----------------------
–R5:I2->I1^I5
----------------------
●● Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
●● R5 is Rejected. ----------------------
–R6:I5->I1^I2 ----------------------
●● Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
----------------------
●● R6 is Selected.
In this way, we have found three strong association rules. ----------------------
Methods to Improve Apriori’s Efficiency ----------------------
●● Hash-based itemset counting: A k-itemset whose corresponding hashing ----------------------
bucket count is below the threshold cannot be frequent.
●● Transaction reduction: A transaction that does not contain any frequent ----------------------
k-itemset is useless in subsequent scans. ----------------------
●● Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB. ----------------------
●● Sampling: mining on a subset of given data, lower support threshold + a ----------------------
method to determine the completeness.
----------------------
●● Dynamic itemset counting: add new candidate itemsets only when all of
their subsets are estimated to be frequent. ----------------------
Association Rule Learning has the most popular applications of Machine
----------------------
Learning in business. It has been widely used to understand and test various
business and marketing strategies to increase sales and productivity by ----------------------
various organizations including supermarket chains and online marketplaces.
Association Rule Learning is rule-based learning for identifying the association ----------------------
between different variables in a database.
----------------------
7.3.3 Self-Organized Map
----------------------
A self-organizing map (SOM) or self-organizing feature map (SOFM)
is a type of artificial neural network (ANN) that is trained using unsupervised ----------------------
learning to produce a low-dimensional (typically two-dimensional), discretized
representation of the input space of the training samples, called a map, and ----------------------
is therefore a method to do dimensionality reduction. Self-organizing maps ----------------------
differ from other artificial neural networks as they apply competitive learning
as opposed to error-correction learning (such as backpropagation with gradient ----------------------
descent), and in the sense that they use a neighbourhood function to preserve
the topological properties of the input space. ----------------------
Unsupervised Learning 115
Notes Self-Organizing Feature Map networks are used quite differently to the
other networks. Whereas all the other networks are designed for supervised
---------------------- learning tasks, SOFM networks are designed primarily for unsupervised
learning (Whereas in supervised learning the training data set contains cases
---------------------- featuring input variables together with the associated outputs and the network
---------------------- must infer a mapping from the inputs to the outputs), in unsupervised learning
the training data set contains only input variables. At first glance this may seem
---------------------- strange. Without outputs, what can the network learn? The answer is that the
SOFM network attempts to learn the structure of the data.
----------------------
The SOFM network can learn to recognize clusters of data, and can also
---------------------- relate similar classes to each other. The user can build up an understanding of
the data, which is used to refine the network. As classes of data are recognized,
----------------------
they can be labelled, so that the network becomes capable of classification
---------------------- tasks. SOFM networks can also be used for classification when output classes
are immediately available - the advantage in this case is their ability to highlight
---------------------- similarities between classes. A second possible use is in novelty detection.
---------------------- SOFM networks can learn to recognize clusters in the training data, and
respond to it. If new data, unlike previous cases, is encountered, the network
---------------------- fails to recognize it and this indicates novelty. A SOFM network has only two
layers: the input layer, and an output layer of radial units (also known as the
----------------------
topological map layer). The units in the topological map layer are laid out in
---------------------- space - typically in two dimensions (although ST Neural Networks also supports
one dimensional Kohonen networks). SOFM networks are trained using an
---------------------- iterative algorithm. Starting with an initially-random set of radial centres, the
algorithm gradually adjusts them to reflect the clustering of the training data.
----------------------
At one level, this compares with the sub-sampling and K-Means algorithms
---------------------- used to assign centres in SOM network and indeed the SOFM algorithm can
be used to assign centres for these types of networks. However, the algorithm
---------------------- also acts on a different level. The iterative training procedure also arranges the
network so that units representing centres close together in the input space are
----------------------
also situated close together on the topological map.
---------------------- You can think of the network’s topological layer as a crude two-
dimensional grid, which must be folded and distorted into the N-dimensional
----------------------
input space, so as to preserve as far as possible the original structure. Clearly
---------------------- any attempt to represent an N-dimensional space in two dimensions will result
in loss of detail; however, the technique can be worthwhile in allowing the user
---------------------- to visualize data which might otherwise be impossible to understand. The basic
iterative Kohonen algorithm simply runs through a number of epochs, on each
----------------------
epoch executing each training case and applying the following algorithm:
---------------------- ●● Select the winning neuron (the one who’s centre is nearest to the input
---------------------- case);
●● Adjust the winning neuron to be more like the input case (a weighted sum
---------------------- of the old neuron centre and the training case).
----------------------
116 Introduction to Data Science, Machine Learning & AI
The algorithm uses a time-decaying learning rate, which is used to Notes
perform the weighted sum and ensures that the alterations become more subtle
as the epochs pass. This ensures that the centres settle down to a compromise ----------------------
representation of the cases which cause that neuron to win. The topological
ordering property is achieved by adding the concept of a neighbourhood to the ----------------------
algorithm. ----------------------
The neighbourhood is a set of neurons surrounding the winning neuron.
----------------------
The neighbourhood, like the learning rate, decays over time, so that initially
quite a large number of neurons belong to the neighbourhood (perhaps almost ----------------------
the entire topological map); in the latter stages the neighbourhood will be zero
(i.e., consists solely of the winning neuron itself). ----------------------
In the Kohonen algorithm, the adjustment of neurons is actually applied not ----------------------
just to the winning neuron, but to all the members of the current neighbourhood.
The effect of this neighbourhood update is that initially quite large areas of the ----------------------
network are “dragged towards” training cases - and dragged quite substantially.
----------------------
The network develops a crude topological ordering, with similar cases activating
clumps of neurons in the topological map. As epochs pass the learning rate and ----------------------
neighbourhood both decrease, so that finer distinctions within areas of the map
can be drawn, ultimately resulting in fine tuning of individual neurons. ----------------------
Often, training is deliberately conducted in two distinct phases: a relatively ----------------------
short phase with high learning rates and neighbourhood, and a long phase with
low learning rate and zero or near-zero neighbourhoods. Once the network has ----------------------
been trained to recognize structure in the data, it can be used as a visualization
----------------------
tool to examine the data. The Win Frequencies Datasheet (counts of the number
of times each neuron wins when training cases are executed) can be examined ----------------------
to see if distinct clusters have formed on the map. Individual cases are executed
and the topological map observed, to see if some meaning can be assigned ----------------------
to the clusters (this usually involves referring back to the original application
----------------------
area, so that the relationship between clustered cases can be established). Once
clusters are identified, neurons in the topological map are labelled to indicate ----------------------
their meaning (sometimes individual cases may be labelled, too). Once the
topological map has been built up in this way, new cases can be submitted to the ----------------------
network. If the winning neuron has been labelled with a class name, the network
----------------------
can perform classification. If not, the network is regarded as undecided. SOFM
networks also make use of the accept threshold, when performing classification. ----------------------
Since the activation level of a neuron in a SOFM network is the distance of the
neuron from the input case, the accept threshold acts as a maximum recognized ----------------------
distance. If the activation of the winning neuron is greater than this distance, the
----------------------
SOFM network is regarded as undecided.
Thus, by labelling all neurons and setting the accept threshold appropriately, ----------------------
a SOFM network can act as a novelty detector (it reports undecided only if ----------------------
the input case is sufficiently dissimilar to all radial units). SOFM networks as
expressed by Kohonen (Kohonen, 1997) are inspired by some known properties ----------------------
of the brain. The cerebral cortex is actually a large flat sheet (about 0.5m
squared; it is folded up into the familiar convoluted shape only for convenience ----------------------
Unsupervised Learning 117
Notes in fitting into the skull!) with known topological properties (for example, the
area corresponding to the hand is next to the arm, and a distorted human frame
---------------------- can be topologically mapped out in two dimensions on its surface).
---------------------- Advantages and Disadvantages of SOM
Self-organise map has the following advantages:
----------------------
Probably the best thing about SOMs that they are very easy to understand.
---------------------- It’s very simple, if they are close together and there is grey connecting them,
then they are similar. If there is a black ravine between them, then they are
----------------------
different. Unlike Multidimensional Scaling or N-land, people can quickly pick
---------------------- up on how to use them in an effective manner.
---------------------- Another great thing is that they work very well. As I have shown you they
classify data well and then are easily evaluate for their own quality so you can
---------------------- actually calculated how good a map is and how strong the similarities between
objects are.
----------------------
These are the disadvantages:
---------------------- One major problem with SOMs is getting the right data. Unfortunately
---------------------- you need a value for each dimension of each member of samples in order to
generate a map. Sometimes this simply is not possible and often it is very
---------------------- difficult to acquire all of this data so this is a limiting feature to the use of SOMs
often referred to as missing data.
----------------------
Another problem is that every SOM is different and finds different
---------------------- similarities among the sample vectors. SOMs organize sample data so that
in the final product, the samples are usually surrounded by similar samples,
----------------------
however similar samples are not always near each other. If you have a lot of
---------------------- shades of purple, not always will you get one big group with all the purples in
that cluster, sometimes the clusters will get split and there will be two groups
---------------------- of purple. Using colour we could tell that those two groups in reality are similar
and that they just got split, but with most data, those two clusters will look
----------------------
totally unrelated. So a lot of maps need to be constructed in order to get one
---------------------- final good map.
The final major problem with SOMs is that they are very computationally
----------------------
expensive which is a major drawback since as the dimensions of the data
---------------------- increases, dimension reduction visualization techniques become more important,
but unfortunately then time to compute them also increases. For calculating that
---------------------- black and white similarity map, the more neighbours you use to calculate the
distance the better similarity map you will get, but the number of distances the
----------------------
algorithm needs to compute increases exponentially.
----------------------
----------------------
----------------------
----------------------
118 Introduction to Data Science, Machine Learning & AI
Notes
Check your Progress 1
----------------------
Fill in the blanks.
----------------------
1. The goal of _______ is not to maximize a utility function, but simply
to find similarities in the training data. ----------------------
2. _______ is rule-based learning for identifying the association between ----------------------
different variables in a database.
----------------------
3. In the _______ algorithm, the adjustment of neurons is actually
applied not just to the winning neuron, but to all the members of the ----------------------
current neighbourhood.
----------------------
----------------------
Activity 1
----------------------
Find and list the applications of unsupervised learning algorithms in different
domains. ----------------------
----------------------
Summary ----------------------
●● In this unit we have discussed unsupervised learning. We have also ----------------------
described the unsupervised learning algorithms such as K-Means
----------------------
Clustering, Apriori Algorithms and Self-Organized Map with example.
----------------------
Keywords
----------------------
●● Pruning: It is a technique in machine learning and search algorithms that ----------------------
reduces the size of decision trees by removing sections of the tree that
provide little power to classify instances. ----------------------
●● Frequent Pattern: Frequent pattern discovery as part of knowledge
----------------------
discovery in databases / Massive Online Analysis, and data mining
describes the task of finding the most frequent and relevant patterns in ----------------------
large datasets.
----------------------
●● Join Operation: This operation pairs two tuples from different relations,
if and only if a given join condition is satisfied. ----------------------
----------------------
Self-Assessment Questions
----------------------
1. Write a short note on unsupervised learning.
2. Explain Apriori algorithm with example. ----------------------
3. Describe the application of K-means clustering. ----------------------
----------------------
Unsupervised Learning 119
Notes Answers to Check your Progress
---------------------- Check your Progress 1
---------------------- Fill in the blanks.
---------------------- 1. The goal of clustering is not to maximize a utility function, but simply to
find similarities in the training data.
---------------------- 2. Association Rule Learning is rule-based learning for identifying the
---------------------- association between different variables in a database.
3. In the Kohonen algorithm, the adjustment of neurons is actually applied
----------------------
not just to the winning neuron, but to all the members of the current
---------------------- neighbourhood.
----------------------
Suggested Reading
----------------------
1. Introduction to Machine Learning by Ethem Alpaydin
---------------------- 2. Machine Learning For Beginners: Machine Learning Basics for Absolute
---------------------- Beginners. Learn What ML Is and Why It Matters. Notes on Artificial
Intelligence and Deep Learning are also Included, by Scott Chesterton
----------------------
3. Understanding Machine Learning: From Theory to Algorithms by Shai
---------------------- Shalev-Shwartz, Shai Ben-David
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron
----------------------
----------------------
----------------------
----------------------
The text is adapted by Symbiosis Centre for Distance Learning under a
---------------------- Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
as requested by the work’s creator or licensees. This license is available at
----------------------
https://creativecommons.org/licenses/by-sa/4.0/.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
120 Introduction to Data Science, Machine Learning & AI
Deep Learning
UNIT
8
Structure:
8.1 Introduction to Deep Learning
8.2 Working/Process of Deep Learning
8.3 Deep Learning and Artificial Neural Networks
8.4 Deep Learning and Artificial Intelligence
8.5 Deep Learning and Machine Learning
8.6 Applications of Deep Learning
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Deep Learning 121
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• Compare deep learning with Artificial Intelligence, Machine Learning
---------------------- and Artificial Neural Networks (ANNs).
---------------------- • Understand the significance of Deep learning in the present scenario
of data explosion in different complex forms.
----------------------
----------------------
8.1 INTRODUCTION TO DEEP LEARNING
----------------------
Artificial intelligence (AI) is the ability of machines, especially computer
---------------------- systems, to demonstrate intelligence like a human brain. Through AI, the
machines simulate the human intelligence processes like learning, reasoning
---------------------- and self-correction. Machine learning (ML) is an application of AI that gives
---------------------- machines, especially computers, the capability to learn without being explicitly
programmed. In ML, computer programs are developed to access data and use
---------------------- it to automatically learn and improve from experience. It is a branch of AI. In
2006, deep learning application originated in the form deep belief networks. The
---------------------- first successful application of deep learning happened in the year 2009 in the
---------------------- field of speech recognition and these results made the speech recognition and
the neural network practitioners and researchers understand the signification of
---------------------- deep learning as a differentiator on previous neural network techniques. This
led to the generation of special name as ‘deep learning’ instead of attributing the
---------------------- success to neural networks.
---------------------- Deep learning (DL) is a specific kind of machine learning that mimics
the network of neurons in a human brain. In deep learning, learning phase
---------------------- is executed through multilayers, that is, deep neural networks. It follows the
---------------------- structure and function of artificial neural networks (ANNs), an architecture
where the layers are stacked on top of each other. It can be treated as a class of
---------------------- ML algorithms that uses multilayers to extract features from the raw input data.
---------------------- From the above description, it can be easily interpreted that deep learning
is a subset of machine learning, which is, in turn a subset of artificial intelligence,
---------------------- as shown in the figure, Fig. 8.1.
---------------------- Artificial Intelligence
Machine Learning
----------------------
Deep Learning
----------------------
----------------------
----------------------
Fig. 8.1 Deep learning as a subset of machine learning and
---------------------- artificial intelligence
122 Introduction to Data Science, Machine Learning & AI
Deep learning took birth from the research and exploration of artificial Notes
neural networks. Deep neural networks (DNNS) are those feed-forward
neural networks consisting of several hidden layers and hence they have deep ----------------------
architecture in which high-level concepts and features in terms of low-level
ones and vice-versa. To train such neural network, a large data set and a large ----------------------
amount computational power are essential needs. ----------------------
Deep learning has two important key aspects. Deep learning models
----------------------
consist of multiple layers or stages of non-linear information processing. They
apply supervised or unsupervised learning methods to represent the features at ----------------------
more abstract or successively higher layers. Deep learning methods can exploit
complex and compositional non-linear functions in learning distributed and ----------------------
hierarchical feature representations and also in making use of both labeled and
----------------------
unlabeled data in an effective manner. It helps to make sense of data which
is in the form of images, sound and text. Currently, as a new area of machine ----------------------
learning, deep learning is getting a considerable place in the research fields
involving artificial intelligence, neural networks, graphical modeling, pattern ----------------------
recognition and signal processing. The enormous growth of data used for
----------------------
training and the recent advances in machine learning, signal and information
processing contribute for the significant growth of deep learning. ----------------------
8.2 WORKING / PROCESS OF DEEP LEARNING ----------------------
Given a set of inputs, deep learning predicts outputs. It uses both ----------------------
supervised and unsupervised learning for training. Since deep learning is a ----------------------
computer software that imitates the network of neurons in human brain, the
learning phase of deep learning is executed through an artificial neural network, ----------------------
which is an architecture where different layers are stacked on top of each
other. The inputs go into neurons and multiplied by weights with the help of a ----------------------
mathematical algorithm, which also updates the weights of all the neurons. The ----------------------
number of layers represents the depth of the neural network. The result of each
layer will become input for the next layer of network. The final layer, called the ----------------------
output layer, generates an actual value for the regression task and a probability
of each class for the classification task. That artificial neural network is fully ----------------------
trained in such a way that the finally optimized weights result in an output ----------------------
close to the real value. Such well-trained neural network can identify an object
included in a picture more accurately than the traditional neural network. In the ----------------------
case of image processing, lower layers may identify edges, whereas the higher
layers may identify the concepts relevant to a human such as digits or letters or ----------------------
faces. ----------------------
Where the inputs and even output are analog in nature like pixel data,
----------------------
text data or audio files, deep learning is the apt technique for such problem
domains. This is very much helpful in identifying the object in photographs and ----------------------
hence found good application in object recognition systems. It follows multiple
stages, as part of training, in the process of recognizing the object and hence all ----------------------
the data representations are hierarchical and trained. In the case of recognizing
----------------------
an object, data learning technique starts from low-level features, then move to
Deep Learning 123
Notes mid-level features and high-level features and finally to the level of trainable
classifier.
----------------------
There are three different ways to use deep learning in classifying objects:
---------------------- (i) Training from scratch,
---------------------- (ii) Transfer learning, and
---------------------- (iii) Feature extraction.
In the case of training from scratch, a very large labeled data set is collected
---------------------- and a network architecture is designed to learn the features and model. This type
---------------------- of approach will be suitable for the new applications or the applications having
a large number of output categories. The networks consume several days of
---------------------- time to train because of large amount of data and rate of learning. Therefore,
this approach is not a common one in applying deep learning.
----------------------
Transfer learning approach is the widely used approach in deep learning
---------------------- applications. It involves fine tuning of a pre-trained model. New data consisting
of previously unknown classes will be fed to the existing network. With the
---------------------- help of some tweaks made to the network, the pre-existing network can be
---------------------- modified and enhanced for a new task. The new task categorizes the specific
objects instead of hundreds or thousands of other objects. This approach needs
---------------------- less data and hence takes less computation time, minutes or hours, instead of
taking days’ time.
----------------------
In the feature extraction approach, all the network layers are trained with
---------------------- learning certain features from the images. All such features can be pulled out
of the network at any time during the training process and be used as input to a
----------------------
machine learning model.
---------------------- The deep learning methods have been vastly growing and becoming richer
by including the methods of neural networks, hierarchical probabilistic models,
----------------------
and a variety of unsupervised and supervised feature learning algorithms.
---------------------- Deep learning uses many layers of non-linear information processing that are
hierarchical in nature. The deep networks can be broadly categorized into three
---------------------- classes. One category is deep networks for unsupervised or generative learning
and these are used to capture higher-order correlation of the observed or visible
----------------------
data for pattern analysis or synthesis purposes in the absence of availability
---------------------- of information about target class labels. Second class is deep networks for
supervised learning and these are used to directly provide discriminative
---------------------- power for pattern classification purposes. The third categorization is hybrid
deep networks, wherein the goal is discrimination assisted with the outcomes
----------------------
of generative or unsupervised deep networks and it can be accomplished by
---------------------- better optimization and/or regularization of the deep networks for supervised
learning.
----------------------
----------------------
----------------------
124 Introduction to Data Science, Machine Learning & AI
8.3 DEEP LEARNING AND ARTIFICIAL NEURAL Notes
NETWORKS ----------------------
Deep learning is represented as large neural networks, which can be ----------------------
trained with the fast computers and availability of huge amounts of data. When
the data is more and more, the performance of such large neural networks goes ----------------------
on increasing. Supervised learning or learning from the labeled data supports
deep learning. ----------------------
A sample deep neural network is given in the figure, Fig. 8.2. The neurons ----------------------
are grouped into three different layers – input layer, hidden layers and output
----------------------
layer. Deep learning uses a neural network having more than one hidden layer.
The input data is given to the input layer; hidden layers perform mathematical ----------------------
computation on the input data and the output layer return the output data. In the
figure, there are eight neurons in the input layer and these neurons receive eight ----------------------
different inputs of data. The input layer transfers the data to the first hidden
----------------------
layer. Neurons are connected to each other like the neurons in the human brain.
Each connection between neurons is associated with a weight. The weights ----------------------
dictate the importance of the input values and initial weights are set randomly.
While predicting the result, the significant factors are identified and accordingly ----------------------
the neuron connections representing them will be assigned big weights. Each
----------------------
neuron is assigned with an activation function. Once a set of input data passes
through all the layers of the network, it returns the result through the output ----------------------
layer.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 8.2. Sample deep learning network
----------------------
8.4 DEEP LEARNING AND ARTIFICIAL INTELLIGENCE
----------------------
Artificial intelligence (AI) processes include learning, reasoning and ----------------------
self-correction, which are the major processes being carried out by human
intelligence. AI processes are handled by computer systems or machines, ----------------------
whereas human intelligence processes are addressed by human brain. Deep
learning involves learning from huge data with the help of the algorithms, ----------------------
Deep Learning 125
Notes which mimic the algorithms generated by human brain. It is the breakthrough
in the field of AI.
----------------------
Deep learning allows training an artificial intelligence with the help of
---------------------- supervised and unsupervised learning to predict outputs based on the given set
of inputs. The number of layers used represents the depth of the deep learning
---------------------- model.
---------------------- The deep learning algorithms support many of the most advanced
artificial intelligence tools and are only as smart as the data given in training.
---------------------- The interactions with Hey Google, Alexa, Google search and Google photos
are all based on deep learning. The more we use them, the more accurately they
----------------------
keep getting the information for us.
----------------------
Check your Progress 1
----------------------
Answer the question.
----------------------
1. Compare deep learning with machine learning.
----------------------
----------------------
8.5 DEEP LEARNING AND MACHINE LEARNING
----------------------
Deep learning is treated as a part of machine learning, because deep
---------------------- learning algorithms also need huge data to learn how to solve the tasks and
given problems. Deep learning uses a multi-layered structure of algorithms in
----------------------
the form of a big artificial neural network (ANN) consisting of more than one
---------------------- hidden layer.
---------------------- Machine learning incorporates standard algorithms like clustering,
regression or classification for various kinds of tasks and they must be trained
---------------------- on data. ML models try to minimize the error between their predictions and the
actual ground truth values.
----------------------
Deep learning does not need feature extraction, when compared to machine
---------------------- learning and this is very important characteristic of deep learning. Since the
layers of the neural network are able to learn an implicit representation of the
---------------------- raw data directly on their own, the feature extraction step is already a part of
---------------------- the ANN process and will be further optimized. Because of this, deep learning
models do not require any manual effort to perform and optimize the feature
---------------------- extraction process. So, in a deep learning model, there is no necessity for
feature extraction step. Whereas, the ML algorithms use the feature extraction
---------------------- through abstract representation of the given raw data and this is usually a very
---------------------- complicated task requiring detailed knowledge of the problem domain.
Machine learning needs fewer data to train the algorithm than deep
---------------------- learning, which requires an extensive and diverse set of data to identify the
---------------------- underlying structure. Machine learning provides a faster trained model,
whereas most advanced deep learning models take days to a week time to train
---------------------- the algorithm. Deep learning gives highly accurate results when compared
126 Introduction to Data Science, Machine Learning & AI
to machine learning. Table 1 gives a comparison between deep learning and Notes
machine learning.
----------------------
Table 1. Comparison of Deep learning with Machine learning
Parameter Machine learning Deep learning ----------------------
Size of training Small to medium Large ----------------------
data set
Number of Many Few ----------------------
algorithms ----------------------
Training/ Short, from few minutes Long, up to weeks, because of
execution time to hours computing significant number ----------------------
of weights.
----------------------
Dependency on Works on low-end Works on powerful machines,
hardware machines preferably with graphical ----------------------
processing unit (GPU)
----------------------
Feature Chooses and understands Does not choose or understand
engineering the best features that best features that represents ----------------------
represent the data the data
----------------------
Activity 1 ----------------------
Link deep learning with A.I., Machine Learning and Neural network. ----------------------
----------------------
8.6 APPLICATIONS OF DEEP LEARNING ----------------------
Deep learning, currently, is finding vast avenues to play a significant role. ----------------------
Following are the major application areas of deep learning:
----------------------
Speech recognition
----------------------
Natural Language Processing
Information retrieval ----------------------
Object recognition and Computer vision ----------------------
Multimodal and Multi-task learning
----------------------
Deep learning uses different acoustic models for speech feature learning
and for speech recognition. It uses primitive spectral or possibly waveform ----------------------
features and uses the lowest level of raw features of speech in the form of
speech sound waveforms for speech recognition and learn the transformation ----------------------
automatically. Deep learning addresses speech synthesis by generating speech ----------------------
sounds directly from text. It found application in processing audio and music.
Natural language processing (NLP) and deep learning are going together ----------------------
by mutually guiding each other. In this case, language models (LMs) are playing ----------------------
crucial role in speech recognition, text information retrieval, statistical machine
translation and other tasks of NLP. Deep learning can develop features or ----------------------
Deep Learning 127
Notes representations from the raw text material, which is appropriate for a wide range of
tasks of NLP. Neural network based deep learning methods are found to perform
---------------------- well on various NLP tasks like language modeling, machine translation, part of
speech tagging, named entity recognition, sentiment analysis and paraphrase
---------------------- detection. Deep learning can perform all these NLP tasks without external hand-
---------------------- designed resources or time-intensive feature engineering. It adopts the concept
of ‘embedding’ to represent symbolic information in language text at the levels
---------------------- of word, phrase and even sentence. Information retrieval is a process wherein a
user enters a query in order to obtain a set of most relevant documents from the
---------------------- automated computer system. In this case, deep learning can be used to extract
---------------------- semantically meaningful features for subsequent document ranking stages.
In computer vision, deep learning can overcome the difficulty in capturing
----------------------
mid-level information such as edge intersections or high-level representation
---------------------- such as object parts. It can adopt unsupervised feature learning to extract
features only and it can adopt supervised learning to jointly optimize feature
---------------------- extractor and classifier components of the full system when there is availability
of huge labeled training data.
----------------------
Deep learning can address the difficult multi-task learning and multi-
---------------------- modal learning problems with the help of shared representations, statistical
strengths across tasks involving separate modalities of audio, image, touch and
----------------------
text and the common semantics associated with all such modalities.
---------------------- Most of the recent advances taking place in intelligence embedded in
different applications and systems are due to deep learning. Applications like
----------------------
self-driving cars and virtual personal assistants or chatbots like Alexa and Siri
---------------------- are the result of deep learning.
---------------------- (i) Self-driving cars: Attempts are on to produce driverless cars successfully
by integrating several smart features. In this case, the main concern is
---------------------- handling unprecedented scenarios and deep learning uses a regular cycle
of testing and implementation to ensure safe driving with more and more
---------------------- exposure to millions of such scenarios. The data from cameras, sensors and
---------------------- geo-mapping helps to create concise and sophisticated models to navigate
the car through traffic by identifying the paths, signs and pedestrian-only
---------------------- routes and analyzing real-time data on traffic volume and load blockages.
---------------------- (ii) Aggregation of news and Fraud news detection: With the help of deep
learning, all the bad, ugly and abusing news can be filtered out from the
---------------------- news feed and the news can be customized as per the readers. The genuine,
fake and fraud news that flows through the internet can be identified and
---------------------- filtered out with the help of classifiers developed by deep learning.
---------------------- (iii) Chatbots or Virtual personal assistants: Virtual personal assistants from
Alexa to Siri to Google Assistant are provided an opportunity to learn
---------------------- more and more about the user voice and accent through deep learning.
---------------------- Through that learning, they start to provide a secondary human interaction
experience. They are trained to know more about their subjects ranging
---------------------- from the user preferences in any aspect ranging from dining-out, movie-
128 Introduction to Data Science, Machine Learning & AI
watching and listening songs, to the user’s most visited spots. They can Notes
also translate the user speech to text, make notes for the user and even
book appointments. ----------------------
(iv) Healthcare: The GPU-accelerated applications and systems that are ----------------------
supported by deep learning can help the patients, physicians, clinicians
and researchers to analyse the problems and contribute to improve the ----------------------
livers of humans and animals. They even mimic an expert physician and
----------------------
diagnose the disease with more efficiency and subside the problem of
shortage of quality physicians and healthcare providers. ----------------------
Summary ----------------------
●● Deep learning automates the process of discovering effective features or ----------------------
representations for any machine learning task, including automatically ----------------------
transferring knowledge from one task to another concurrently. It uses
an artificial neural network (ANN) to imitate a human or an animal ----------------------
intelligence. The input layer, the hidden layers and the output layer are the
three types of layers present in the deep learning architecture. The neurons ----------------------
are connected with each other and these connections are associated with a ----------------------
weight representing the importance of the input value. The neurons apply
an activation function on the data to standardize the output coming out of ----------------------
the neuron. Iterations taking place through the data set and comparison
of outputs lead to produce the cost function indicating the deviation of ----------------------
the output from the real output. Deep learning is an improved version ----------------------
of machine learning by using ANNs. Current emerging applications
with intelligence are being developed with deep learning. Deep learning ----------------------
enabled may practical applications of machine learning and extended the
overall field of artificial intelligence. ----------------------
----------------------
Keywords
----------------------
●● Artificial intelligence, Artificial neural networks, machine learning,
----------------------
DNNs.
----------------------
Self-Assessment Questions
----------------------
1. Explain the features of deep learning.
----------------------
2. Give a note on applications of deep learning.
----------------------
3. Describe how does deep learning work.
----------------------
----------------------
----------------------
----------------------
Deep Learning 129
Notes
Suggested Reading
----------------------
1. “Dive into Deep Learning” by Aston Zhang, Zachary C. Lipton, Mu Li,
---------------------- and Alexander J. Smola.
2. “Deep Learning - Methods and Applications” by Li Deng and Dong Yu.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
130 Introduction to Data Science, Machine Learning & AI
Artificial Intelligence
UNIT
9
Structure:
9.1 Introduction to Artificial Intelligence
9.2 Characteristics of Artificial Intelligence
9.3 Advantages of Artificial Intelligence
9.4 Components of Artificial Intelligence
9.5 Broad Categorise of Artificial Intelligence
9.6 Technologies of Artificial Intelligence
9.7 Artificial Intelligence and Machine Learning
9.8 Applications of Artificial Intelligence
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Artificial Intelligence 131
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• nderstand the meaning, concept and evolution of Artificial
U
---------------------- intelligence
---------------------- • ecall the significant role being played by AI in the currently available
R
and emerging tools and gadgets
----------------------
• Have knowledge about the applications of AI in different domains
----------------------
----------------------
9.1 INTRODUCTION TO ARTIFICIAL INTELLIGENCE
----------------------
Artificial intelligence (AI) can be treated as a branch of computer science
---------------------- concerning with the automation of intelligent behavior. According to Sternberg
(1985), intelligence is the ability to adapt, shape and select environments and
---------------------- depends on three facets – analytical, creative and practical thinking. In 1950s,
---------------------- Alan Turing’s paper ‘Computing Machinery and Intelligence’ established the
foundational goal and subsequent vision of AI. Novig and Russel explored
---------------------- different thinking and acting approaches to define the field of AI – thinking
humanly and rationally, concerning about thought processes and reasoning and
---------------------- acting humanly and rationally, concerning with behavior.
---------------------- The goal of AI is to develop machines that can mimic human intelligence.
First it aims to enhance understanding of human intelligence by modeling the
---------------------- brain so as to build machines that can replicate biological functions, especially
---------------------- thinking and deploying knowledge. This method is called connectionism. Next,
AI aims to make a mind through the representation of processes of human
---------------------- thinking in machines. It concerns with building smart machines capable of
performing the tasks that typically require human intelligence. Therefore, AI is
---------------------- the simulation of human intelligence by machines, especially computer systems.
---------------------- AI has two major fundamental features – knowledge representation
and search. AI includes data structures used in knowledge representation, the
----------------------
algorithms for apply that knowledge and the languages and programming
---------------------- techniques needed for their representation. Most AI programs represent
knowledge in some formal language and is manipulated by algorithms. Search
---------------------- is a problem-solving technique through which a space of problem states is
systematically explored like the different board configurations in a chess game
----------------------
or intermediate steps in a reasoning process. The final solution is arrived by
---------------------- searching the space of alternative solutions.
The costs to be incurred for having hardware, software and developers
----------------------
make AI expensive. Therefore, many vendors are including AI components
---------------------- bundled in their standard offerings, and providing access to AI as a service
(AIaaS) platforms.
----------------------
132 Introduction to Data Science, Machine Learning & AI
9.2 CHARACTERISTICS OFARTIFICIAL INTELLIGENCE Notes
There are different aspects of Intelligence like learning, reasoning, ----------------------
problem-solving, perception and language-understanding. Learning occurs by
----------------------
acquiring information and rules to reach approximate or definite conclusion.
AI makes use of heuristics, which are powerful techniques to determine the ----------------------
alternatives to explore in a problem space. A heuristic is a useful and potential
problem-solving strategy and much of intelligence seems to reside in the ----------------------
heuristics used by humans to solve problems. Learning is of different forms –
----------------------
learning by trial-and-error and rote learning. The errors that are encountered can
be reduced by repeated trials and this type of learning is called trial-and-error ----------------------
learning. Rote learning is memorizing simply the individual items like solutions
to problems and words of vocabulary. Reasoning involves drawing appropriate ----------------------
inferences in a situation in hand. The inferences may be either deductive or
----------------------
inductive. Deductive reasoning moves from generalized principles that are
known to be true to a true and specific conclusion. Inductive reasoning moves ----------------------
from specific instances into a generalized conclusion. There are two types of
problem-solving methods, they are special-purpose and general-purpose. A ----------------------
special-purpose problem-solving method is a tailor-made one for a particular
----------------------
problem and it often exploits very specific features of the situation in which
the problem is embedded. A general-purpose method can be applied to a wide ----------------------
range of different problems. Perception involves scanning the environment by
means of various sense-organs, real or artificial. The processes of the perceiver ----------------------
analyze the scene into objects and their features and relationships. A language
----------------------
is a system of signs which have meaning by convention. A mini-language is
a matter of convention concerning about particular signs having their own ----------------------
meanings. A productive language enables an unlimited number of different
sentences to be formulated within it. Human language is a productive language, ----------------------
which is rich enough to support different sentences.
----------------------
AI programming can answer the generic questions it is meant to solve.
It can absorb new modifications by maintaining highly independent pieces of ----------------------
information together. Without affecting the structure of program, a minute part ----------------------
of program information can be modified. The program can be modified easily
and quickly. ----------------------
The important characteristics of AI include: ----------------------
Use of computers to perform reasoning, pattern recognition, learning or
some other form of inference; ----------------------
Focus on problems that do not respond to algorithmic solutions; ----------------------
Use of representational formalisms to support the programmer to solve ----------------------
problems using inexact, missing or poorly defined information;
----------------------
Reasoning about significant qualitative features of a situation;
Dealing with issues of semantic meaning as well as syntactic form; ----------------------
----------------------
Artificial Intelligence 133
Notes Reliance on heuristic problem-solving methods in situations where
optimal or exact results are either too expensive or not possible;
----------------------
Use of large amounts of domain-specific knowledge in solving problems;
---------------------- Use of meta-level knowledge for more sophisticated control problem-
solving strategies; Codifying and permanently storing the information and
----------------------
knowledge used in a decision problem, and noting the problem-solving
---------------------- processes and strategies used in its resolution for later recall in subsequent
similar problems;
----------------------
Allowing the development of knowledge bases that capture expressible
---------------------- explicit and accessible knowledge and eliminating the need for data duplication.
---------------------- 9.3 ADVANTAGES OF ARTIFICIAL INTELLIGENCE
----------------------
AI has several advantages and some important advantages are given below:
---------------------- (i) Automation of repetitive learning and discovery through data: AI is
different from hardware-driven robotic automation. It performs frequent,
----------------------
high-volume and computerized tasks reliably and without fatigue. It
---------------------- can speed up the process of verifying documents and can handle any
monotonous or boring tasks of human beings.
----------------------
(ii) Adaptation through progressive learning algorithms: Structure and
---------------------- regularities in the data are captured through AI, so that the algorithm
acquires a skill and becomes a classifier or a predictor. With the technique
---------------------- of back propagation, the model is allowed to adjust through training and
---------------------- added data, when the first answer is not quite right.
(iii) Addition of intelligence: The existing products can be improved with AI
---------------------- capabilities. In this way, automation, conversational platforms, bots and
---------------------- smart machines are being combined with large amounts of data to enhance
the capabilities of many technologies and technical systems meant for
---------------------- home and industry purposes and spreading from security intelligence to
investment analysis.
----------------------
(iv) Analysis of more and deeper data: AI in the form of artificial neural
---------------------- networks (ANNs) can have many hidden layers, can possess incredible
computing power and handle big data. The more the data is fed to such
---------------------- deep networks, the more accurate analytical results that can be derived.
---------------------- (v) Getting most out of data: The AI algorithms, when become self-learning,
the data itself can become intellectual property. Having the best data
----------------------
contributes to competitive advantage for any industry. AI can make
---------------------- machines take decisions faster than a human and carry out actions quickly.
(vi) Incredible accuracy: Deep neural networks allow achieving incredible
----------------------
accuracy by handling large amounts of data and using high computational
---------------------- power. In healthcare field, AI techniques use deep learning, image
classification and object recognition to find cancer on MARIs with the
---------------------- same accuracy as in the case of highly trained radiologists.
134 Introduction to Data Science, Machine Learning & AI
(vii) Reduction inhuman error: When computers are properly programmed, Notes
they avoid making mistakes as against humans. With AI, by applying
a certain set of algorithms on the previously gathered information, the ----------------------
decisions can be taken. Reduction in errors leads to improving accuracy
with a greater degree of precision. ----------------------
----------------------
9.4 COMPONENTS OF ARTIFICIAL INTELLIGENCE
----------------------
AI consists of four major components. They are Applications, Models,
Software/Hardware and Programming languages. ----------------------
----------------------
Artificial Intelligence ----------------------
----------------------
Applications Models Software/Hardware Programming languages ----------------------
The applications of AI include image recognition, speech recognition, ----------------------
chatbots, natural language processing, and sentiment analysis. Different types ----------------------
of models of AI are widely used and some are emerging. They include neural
networks, machine learning and deep learning. AI needs powerful software ----------------------
and hardware to train and run models. They include graphical processing units
(GPUs), parallel processing tools like Apache Spark, cloud data storage and ----------------------
computing platforms. For building models, AI uses different programming ----------------------
languages like C, Java, TensorFlow and Python.
----------------------
9.5 BROAD CATEGORIES OF ARTIFICIAL ----------------------
INTELLIGENCE
----------------------
Artificial intelligence generally falls under three broad categories or
levels – Narrow AI, General AI and Active AI. ----------------------
Narrow AI or Weak AI is a kind of AI that operates within a limited context ----------------------
and is a simulation of human intelligence. It is often focused on performing a
single and specific task extremely well than a human. While these machines ----------------------
may seem intelligent, they operate under far more constraints and limitations
----------------------
than even the most basic human intelligence. Unlike General or strong AI,
Artificial narrow intelligence (ANI) systems can attend a specific task in real- ----------------------
time, but is not conscious, feeling, or driven by emotion the way that humans
are. It operates within a pre-determined and pre-defined range. Every kind of ----------------------
machine intelligence that is available today is ANI. Examples include Google
----------------------
Assistant, Siri, Google Translate and other natural language processing tools.
These machines are nowhere close to having human-like intelligence and they ----------------------
cannot think for themselves and they lack self-awareness, consciousness and
genuine intelligence to match human intelligence. But these systems are able to ----------------------
process data and complete tasks at a significantly faster pace than any human
----------------------
Artificial Intelligence 135
Notes being can. They act as the building blocks of more intelligent AI in the near
future.
----------------------
General or Strong AI is a kind of AI that operates in a machine with
---------------------- general intelligence and much like a human being and apply that intelligence
to solve any problem. AI reaches the general state when it can perform any
---------------------- intellectual task with the same accuracy level as a human would. This is the
kind of AI that can be seen in movies in which humans interact with machines
----------------------
and operating systems that are conscious, sentient, and driven by emotion and
---------------------- self-awareness. Examples include manufacturing and drone robots. Artificial
general intelligence (AGI) systems are expected to be able to reason, solve
---------------------- problems, make judgements under uncertainty, plan, learn, integrate prior
knowledge in decision-making, and be innovative, imaginative and creative.
----------------------
AGI field is still an emerging field.
---------------------- Active or Super AI is a kind of AI that can beat humans in many tasks
and AI is more capable than a human. It can surpass human intelligence in all
----------------------
aspects from creativity, to general wisdom, to problem-solving. Artificial super
---------------------- intelligence (ASI) is a way into the future by performing extraordinarily well at
things such as arts, decision making, and emotional relationships.
----------------------
---------------------- 9.6 TECHNOLOGIES OF ARTIFICIAL INTELLIGENCE
---------------------- In decision-making, AI plays a key role by taking into account different
alternative solutions and selecting the best out of them. There are several
---------------------- decision-oriented technologies of AI. Some important technologies are given
below:
----------------------
Automation: This technology makes an AI system or process function
---------------------- automatically. Example is robotic process automation, which can be programmed
to perform high-volume and repeatable tasks that humans normally perform.
----------------------
Machine learning: This technology can enable a computer or a machine act
---------------------- without programming. Three types of machine learning algorithms are there –
supervised, unsupervised and re-enforcement learning.
----------------------
Expert systems: Expert systems are computer-based systems that are fed with
---------------------- rule bases to summarize the reasoning and knowledge used by experts to solve
---------------------- problems. There are three types of expert systems – consultant, associate and
tutor. There are two roles of expert systems for decision making – advisory and
---------------------- replacement. Advisory expert systems support and advise a decision maker,
whereas replacement expert systems stand in the place of a decision maker,
---------------------- so that they can take and implement a decision without the need for human
---------------------- approval.
Natural language processing technologies: These technologies enable
---------------------- computers communicate with their users in their native language rather than
---------------------- through menus, forms, commands or graphical user interfaces. NLP tasks
include text translation, sentiment analysis and speech recognition.
----------------------
136 Introduction to Data Science, Machine Learning & AI
Neural computing or Artificial Neural Network (ANN) technologies: These Notes
technologies emulate the way that neurons work in human brains. ANNs consist
of nodes and connections of varying weights and these weights are associated ----------------------
with the frequency with which particular patterns have been observed in the
past. ----------------------
Machine vision: This technology allows machines to see and capture and ----------------------
analyze visual information using a camera, analog-to-digital conversion and
----------------------
digital signal processing. It is used in applications like signature identification
and medical image analysis. ----------------------
Theorem proving: These technologies provide mathematical theorems and
----------------------
are helpful for conceptual use of AI. They strive to build computer that make
inferences and draw conclusions from existing facts. Most of the AI systems ----------------------
possess the ability to reason, which allows them to deduce facts even in the face
of incomplete and erroneous information. ----------------------
----------------------
9.7 ARTIFICIAL INTELLIGENCE AND MACHINE
LEARNING ----------------------
Machine learning encompasses AI mechanisms that allow a computer to ----------------------
identify patterns in historical data that are important for modeling a problem, ----------------------
and thereby to learn from past experience and examples. The identified patterns
are helpful to detect irregularities, making predictions, classifying items or ----------------------
behaviors and providing decision support. Machine learning is the ability of a
machine to learn using large data sets, instead of hard coded rules. ----------------------
Machine learning is an approach to achieve AI and is the practice of ----------------------
using algorithms to parse data, learn from it, and then make a prediction about
something. Deep learning enabled many practical applications of machine ----------------------
learning by extending the overall field of AI. It takes advantage of the processing ----------------------
power of modern computers in order to process large data sets. Machine learning
adapts unsupervised learning and uses data sets without specific structure. ----------------------
----------------------
Activity 1
----------------------
Discuss advances and trends emerged from the concept of Artificial
Intelligence. ----------------------
----------------------
9.8 APPLICATIONS OF ARTIFICIAL INTELLIGENCE
----------------------
AI finds its significant role in developing gaming software like smart
----------------------
chess, poker and tic-tac-toe. In these applications, the machine can think of
large number of possible position and moves based on heuristic knowledge. ----------------------
AI allows interaction with the computer system, which can understand natural
language spoken by humans. Expert systems can be developed with AI by ----------------------
integrating machine, software and special information to impart reasoning
----------------------
Artificial Intelligence 137
Notes and advising. Expert systems provide explanation and advice to the users like
human experts. Vision systems can be built with AI to understand, interpret and
---------------------- comprehend visual input given to the computer system. Intelligent systems can
hear and comprehend the language in terms of sentences and their meanings
---------------------- while a human interacts with them. There are software developed to recognize
---------------------- human handwriting and they can read the text written on paper by a pen or on
screen by a stylus. They can recognize the shapes of the letters and convert them
---------------------- into editable text form. Intelligent Robots are developed to perform the human
tasks. Using sensors, they can detect physical data like light, heat, temperature,
---------------------- movement, sound, bump and pressure from the real world environment. Using
---------------------- their efficient processors, multiple sensors and large memory, they exhibit
intelligence and can learn from their mistakes and adapt to the new environment.
---------------------- The applications of AI in different fields are briefed below:
---------------------- AI in healthcare: The biggest bets are on improving patient outcomes and
reducing costs.. Companies are applying machine learning to make better and
---------------------- faster diagnoses than humans. One of the best known healthcare technologies
is IBM Watson. It understands natural language and is capable of responding
----------------------
to questions asked of it. The system mines patient data and other available
---------------------- data sources to form a hypothesis, which it then presents with a confidence
scoring schema. Other AI applications include chatbots, a computer program
---------------------- used online to answer questions and assist customers, to help schedule follow-
up appointments, aid patients through billing process. Virtual health assistants
----------------------
can provide basic medical feedback, act as life coaches by reminding the
---------------------- patients when to take medicine, do exercise, or eat healthier as per diet plan. AI
applications can provide personalized medicine and even scanning and X-ray
---------------------- readings.
---------------------- AI in business: Machine learning algorithms are being integrated into business
analytics and customer relationship management (CRM) platforms to uncover
---------------------- information on better service to the customers. Chatbots are being incorporated
into websites to provide immediate service to the customers. Virtual shopping
----------------------
capabilities can be provided to offer personalized recommendations and discuss
---------------------- purchase options with the customers. Stock management can also be improved.
AI in manufacturing: Manufacturing companies are finding good support from
----------------------
industrial robots to perform the workflow by involving them in single tasks and
---------------------- separating from human workers. The Internet of things (IoT) data generating in
the form of streams from the connected equipment can be processed to forecast
---------------------- expected load and demand.
---------------------- AI in finance: Financial applications supported by AI are disrupting financial
institutions. They collect personal data, process and provide proper financial
---------------------- advices and help in making appropriate decisions. AI techniques can be used to
---------------------- identify which transactions are likely to be fraudulent, to adopt fast and accurate
credit scoring, as well as automate manually intense data management tasks.
----------------------
----------------------
138 Introduction to Data Science, Machine Learning & AI
Notes
Check your Progress 1
----------------------
Fill in the blanks.
----------------------
1. ______ tasks include text translation, sentiment analysis and speech
recognition. ----------------------
2. _________ systems are expected to be able to reason, solve problems, ----------------------
make judgements under uncertainty, plan, learn, integrate prior
knowledge in decision-making, and be innovative, imaginative and ----------------------
creative.
----------------------
----------------------
Summary
----------------------
●● Artificial intelligence (AI) can be treated as a branch of computer
science concerning with the automation of intelligent behavior. Artificial ----------------------
intelligence (AI) is the simulation of human intelligence processes by
use of machines, especially computer systems. Specific applications of AI ----------------------
include expert systems, natural language processing (NLP), and speech ----------------------
recognition and machine vision. Machine Learning includes AI identify
patterns in historical data that are important for modeling a problem. AI is ----------------------
“the study and design of intelligent agents”
----------------------
Keywords ----------------------
●● IoT : Intenet of Things ----------------------
●● CRM : Customer Relationship Management ----------------------
●● ANI : Artificial narrow intelligence
----------------------
●● NLP : Natural Language Processing
----------------------
Answers to Check your Progress
----------------------
1. NLP
----------------------
2. Artificial general intelligence (AGI)
----------------------
Self-Assessment Questions ----------------------
1. Explain the features of Artificial Intelligence. ----------------------
2. Distinguish between Artificial intelligence and Machine learning. ----------------------
3. Describe any five major applications of Artificial intelligence.
----------------------
----------------------
----------------------
Artificial Intelligence 139
Notes
Suggested Reading
----------------------
1. Artificial Intelligence – A Modern Approach (3rd Edition) – By Stuart
---------------------- Russell & Peter Norvig
2. Artificial Intelligence for Humans – By Jeff Heaton
----------------------
3. Machine Learning for Beginners – By Chris Sebastian
----------------------
4. Artificial Intelligence: The Basics – By Kevin Warwick
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
140 Introduction to Data Science, Machine Learning & AI
Business Intelligence
UNIT
10
Structure:
10.1 Introduction to Business Intelligence
10.2 Features of Business Intelligence
10.3 Business Intelligence Process
10.4 Factors contributing for Successful Business Intelligence
10.5 Business Intelligence and Business Analytics
10.6 Business Intelligence and Big Data
10.7 Advantages and Applications of Business Intelligence Summary
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Business Intelligence 141
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• nderstand the meaning, concept and evolution of Artificial
U
---------------------- intelligence
---------------------- • ecall the significant role being played by AI in the currently available
R
and emerging tools and gadgets
----------------------
• Have knowledge about the applications of AI in different domains
----------------------
----------------------
10.1 INTRODUCTION TO BUSINESS INTELLIGENCE
----------------------
---------------------- Generally, ‘data’ refers to raw form of facts and observations, whereas
‘information’ is referred to as processed data related to any subject or context.
---------------------- If such information is again processed, it generates ‘knowledge’, which can
be referred to as the information subjected to experience, learning or insight.
---------------------- The further processing and analysis of such knowledge, thus developed from
---------------------- processing of information, generates ‘wisdom’ or ‘intelligence’ on the subject
or context to develop and execute strategies and tactics and make an appropriate
---------------------- decision. If the tag of ‘business’ is attached to all these four elements, namely,
data, information, knowledge and intelligence in order to address the widely
---------------------- pervaded business environment in the human world, they transform into the
---------------------- important business related terms – business data, business information, business
knowledge and finally business intelligence. The business data generate from
---------------------- the daily transactions and through interactions between customers, employees
and organization. BI helps the business organizations gather information;
---------------------- add value to the information through analysis and report the findings to the
---------------------- concerned managers to solve wide variety of business related problems.
Figure 10.1 gives the path of business intelligence starting from business
---------------------- data.
----------------------
----------------------
----------------------
---------------------- Fig. 10.1. Business intelligence path
---------------------- In view of the above, business intelligence (BI) can be treated as a data
driven process that combines storage and collection of data with knowledge
---------------------- management in order to provide inputs to the decision making process of
---------------------- business. It is a set of concepts and methods based on fact-based support
systems to improve decision making by turning the data into meaningful and
---------------------- actionable information towards a strategic goal. It requires both technical
142 Introduction to Data Science, Machine Learning & AI
and organizational elements for analysis, querying and reporting to improve Notes
business processes and to make decisions effectively. It is helpful in collecting
and analyzing information on markets, new technology, customers, competitors, ----------------------
and broad social trends from both internal and external sources of information.
BI needs processes, skills and right data and also supportive tools, infrastructure ----------------------
and software applications and technologies to access and analyse the data and ----------------------
generate actionable knowledge and intelligence.
----------------------
10.2 FEATURES OF BUSINESS INTELLIGENCE ----------------------
Since BI directly impacts the strategic, tactical and operational business ----------------------
decisions of organizations, it supports fact-based decision making. Some
important features of BI are given below: ----------------------
Detailed picture of business: BI provides needful and detailed intelligence ----------------------
to the users about the existing business by performing data analysis and creating
reports, summaries, dashboards, maps, graphs and charts. Market trends can be ----------------------
identified along with spotting the business problems that are to be addressed.
----------------------
Measurement: By analyzing the historic business data, BI can create
major key performance indicators (KPIs) useful for enhancing the performance ----------------------
of business. KPIs show in numbers what is most important to the business of
----------------------
the organization. These indicators are directly connected to the objectives of the
organization. ----------------------
Process benchmarking: BI helps the organizations identify and set
----------------------
benchmarks for varied business processes in order to improve their performance.
Data visualization: Data visualization is supported by BI to enhance the ----------------------
data quality and thereby enhancing the quality of decision making.
----------------------
10.3 BUSINESS INTELLIGENCE PROCESS ----------------------
Business intelligence includes five important elements – Reporting, ----------------------
Analysis, Data mining, Data quality and interpretation, and Predictive analysis.
Reporting deals with accessing and formatting the data and delivering the ----------------------
meaningful information. Through analysis, patterns and relationships among
----------------------
the data can be identified. Data mining is the in-depth analysis of data to
extract valuable and useful information. Since business intelligence focuses on ----------------------
generating valuable and meaningful insights in the form of information from
the data, it ensures quality in the data collected and also the interpretations it ----------------------
makes at the end. Through predictive analysis, business intelligence attempts
----------------------
to predict probabilities and trends for future decision making. Of all these
elements, reporting and analysis are the central building blocks of business ----------------------
intelligence.
----------------------
Generally, the process of business intelligence includes three important
phases – reporting, analysis and actionable decisions. Through reporting phase, ----------------------
the data is gathered and organized. Through analysis phase, the data is turned into
meaningful information. Third phase deals with making actionable decisions to ----------------------
Business Intelligence 143
Notes achieve the strategic goal. The detailed process of business intelligence consists
of the following five important phases:
----------------------
Identifying Data sources;
---------------------- Extracting, Transforming and Loading
(ETL) of right data in the form of a data warehouse;
----------------------
Mining the data through online analytical processing (OLAP);
----------------------
Data analysis and subsequent findings to be reported; and
---------------------- Decision-making based on the reporting of findings.
---------------------- The raw data from the organizational databases is extracted, cleaned
and transformed into the data warehouse so as to develop different queries for
---------------------- generating relevant answers, ad-hoc reports on any issue and conduct any other
---------------------- analysis. ETL processes transform and load data from different data sources.
Figure 10.2 presents these five phases of BI.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig.10.2. Business Intelligence process (www.canstockphoto.com)
----------------------
---------------------- 10.4 FACTORS CONTRIBUTING FOR SUCCESSFUL
BUSINESS INTELLIGENCE
----------------------
Success factors for making Business intelligence successful in organizations:
----------------------
Focus on business problems and then on the relevant data: Before going
---------------------- for implementation of business intelligence, the firm should focus on the
---------------------- important business problems that demand proper actionable decision making.
144 Introduction to Data Science, Machine Learning & AI
Once the problem to solve is clear, the data required to address the problem Notes
should be identified.
----------------------
Strategy and objectives: Business intelligence should focus on the clearly
defined business strategy of the organization with clear vision of the future, ----------------------
followed by clear objectives in the short and long run of business.
----------------------
Stakeholder analysis: Business intelligence should consider all the
stakeholders that are related to the business problem going to be solved. ----------------------
The stakeholders may include different managerial hierarchy levels in the
organization, customers and the employees. Clear identification of stakeholders ----------------------
helps in identifying right data sources to collect the data relevant to business
----------------------
intelligence process.
Focus on KPIs: KPIs should be identified for the BI work designed. They ----------------------
should be monitored continuously on a regular basis. Company’s core KPIs
----------------------
should be understood properly before analyzing the results being generated by
BI. ----------------------
Selection of ideal Business intelligence tools and technologies: Depending ----------------------
on the information demands, the data sources identified and the decision-
making processes involved, proper BI tools are to be selected carefully and ----------------------
adapted effectively. The tools and software that are selected should be capable
of addressing the KPIs benchmarked. BI tools and technologies also include ----------------------
data warehouses, dashboards, ad-hoc reporting, data discovery tools and cloud ----------------------
data services.
BI project team formation: The BI planning and implementation activities ----------------------
form a project work to be handled by a suitable project team having required ----------------------
technical and managerial skills. Since BI is used by different divisions and
departments in an organization, the project team should be formed across the ----------------------
organization by involving the members of the concerned departments. With
this, every member of the project team will be aware of the process changes that ----------------------
are going to take place and the ultimate project success. ----------------------
Preparedness to go with latest trends in data science: Artificial intelligence,
machine learning and deep learning and cloud analytics evolved to handle ----------------------
complex tasks of human intelligence. These technologies should leverage the ----------------------
real-time data analysis and dashboard reporting of BI.
----------------------
Collaborative Business Intelligence: BI software should be flexible to
combine with several collaborative tools like social media and other latest ----------------------
technologies for collaborative decision making among the teams and the whole
organization. ----------------------
Integration with other business applications: BI software should be capable ----------------------
of integrating some or total features of it with other business applications to
enhance and extend the reporting functionality. ----------------------
----------------------
----------------------
Business Intelligence 145
Notes 10.5 BUSINESS INTELLIGENCE AND BUSINESS
----------------------
ANALYTICS
---------------------- Business intelligence deals with what happened in the past and how it has
happened leading up to the present moment. It does not address the questions
---------------------- related to why, nor look at predicting the future. On the other hand, business
analytics (BA) deals with the why’s of what happened in the past and make
---------------------- predictions of what will happen in the future if the trend continues. For example,
---------------------- in the case of sales firm, BI helps to identify what products had been most
successful in the past and also the seasonal trends that had contributed to the
---------------------- success for past launches. In the same case, BA can be helpful to examine the
reason behind customers buying the past successful products.
----------------------
Both business intelligence and business analytics are data management
---------------------- solutions being implemented in the firms to collect and analyse historical and
current data and generate insights to make for better future decisions. Both are
----------------------
considered for organizational learning and adjustments to improve operational
---------------------- efficiency and strengthen organizational intelligence. They are helpful to
generate insights and knowledge from both internal and external sources of
---------------------- knowledge. They help both the individuals and the organization in increasing
their capabilities to receive, store, analyze and transfer information with few
----------------------
errors and gaps.
---------------------- BI analyses past and present data to drive current business needs and
run current business operations. Business analytics analyses past data to drive
----------------------
current business and can change business operations and improve productivity.
---------------------- BI focuses on current business operations, whereas BA focuses on future
business operations. BA can take the information from BI as input to process it
---------------------- in a more sophisticated way to visualize the analyzed data.
----------------------
Activity 1
----------------------
---------------------- List the BI tools that are open source/ free. Which one will you prefer to use
and why?
----------------------
---------------------- 10.6 BUSINESS INTELLIGENCE AND BIG DATA
---------------------- Because of advent of big data, the accumulation and processing of high
speed, voluminous and variety data threw challenges on BI processing. In
----------------------
view of this, BI has been moving away from traditional descriptive analysis of
---------------------- data for mere reporting purpose to predictive and prescriptive analysis for fast
analysis to make decisions instantaneously. To maintain pace with such flow
---------------------- and analysis of big data, BI is becoming an underlying concept by pairing with
artificial intelligence, machine learning and fast analytics. BI has been evolving
----------------------
driven by Big data, Cloud and advanced analytics. The goal of BI and Big data
---------------------- is to help the business organizations make better decisions by analyzing large
146 Introduction to Data Science, Machine Learning & AI
data sets. Both need to be synchronized and used together. Even though they Notes
both are not the same, they share a lot of same common goals.
----------------------
Big data deals with huge amounts of data and is broader in scope in
exploring previous unknowns with an ultimate aim to learn by examining ----------------------
the organization’s own operational and machine data. Big data can integrate
analytics into the business operations and can impact the business results ----------------------
directly. Once learning is achieved satisfactorily by Big data, BI processes can
----------------------
be used for additional exploration and reporting. Big data can get the data in
three different formats – Structured data, Semi-Structured data and unstructured ----------------------
data. Structured and semi-structured data can be stored in data lakes or data
warehouses. Data warehouses can store only the processed data for a specific ----------------------
purpose and usage by business professionals. Unstructured data is user-
----------------------
generated data and generates from social media continuously in the form of
tweets, re-tweets, likes, shares and comments. Such real time online data can be ----------------------
handled by stream processing technique of Big data. Once the data is captured
from different sources and processed, the Big data will be ready for analysis ----------------------
by using suitable analytical models and data visualization techniques. A part
----------------------
of the processed data that is generating from Big data process can be taken as
input by the BI model to carry out its further processing and analysis. The BI ----------------------
model may speed up its data warehouse (DWH), make it more fault-tolerant
and explore and discover new insights in the received business data for quicker ----------------------
decisions. If the richness of both BI and Big data technologies is combined, the
----------------------
view of organizational data can be widen up more so as to get more detailed and
complete data for analysis. Figure 10.3 presents how BI can be integrated with ----------------------
Big data architecture.
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Fig. 10.3. Business intelligence with Big data architecture
(https://www.clearpeaks.com/what-can-big-data-do-for-bi/) ----------------------
Even though big data is related to BI in several ways, it is different from ----------------------
BI in the way it analyzes the data. Table 1 lists the differences between BI and
----------------------
Big data.
----------------------
----------------------
Business Intelligence 147
Notes Business Intelligence Big Data
---------------------- BI aims to help the business to make The main purpose of Big Data is to
better decisions. It helps in delivering capture, process, and analyze the data,
---------------------- accurate reports by extracting info both structured and unstructured to
directly from the data source. improve customer outcomes.
----------------------
BI makes use of operation systems, Big data solutions take the help of
---------------------- ERP databases, DW, Dashboard, etc. Hadoop, Spark, R Server, Hive,
HDFS, etc.
----------------------
BI uses several tools which enable Big data uses tools or frameworks
---------------------- business to collate, analyze and store large amount of data and process
visualize data, which can be used in them to get the insights from data to
---------------------- making better decisions and to come make good decisions for business.
---------------------- up with good strategic plans. The tools They include Hadoop, Spark, Hive,
include: Tableau, Qlik Sense, OLAP, Polybase, Presto, Cassandra, Plotly,
---------------------- Sisense, DW, Digital dashboards and Cloudera, Storm, etc.
Data mining, Microsoft Power BI,
---------------------- Google Analytics, etc.
---------------------- Six important features of BI: Location Big data can be described by some
intelligence, Executive Dashboards, distinguishing characteristics, such as
---------------------- ‘What if” analysis, Interactive reports, Volume, variety, variability, velocity
---------------------- Meta data layer and Ranking reports. and Veracity of data to be handled.
Application fields of BI include Social Big data can be applied to Banking
---------------------- media, Healthcare, Gaming industry sector, entertainment and Social media,
---------------------- and Food industry. Healthcare, Retail and Wholesale.
BI helps in finding answers to the Big data helps in finding unknown
----------------------
known business questions. questions and answers.
---------------------- In BI, all the business data is combined Big data can capture and analyze both
into a central server and is analyzed in offline and online data. The big data
----------------------
offline mode. is stored on a distributed file system
---------------------- like HDFS (Hadoop Distributed File
System), rather than storing on a
---------------------- central server.
---------------------- In BI, the information is saved in Data Data will be distributed across the
Warehouse platform. worker nodes for easy processing.
---------------------- Distributed File System is much safer
and flexible. Data lakes and data
----------------------
warehouses are helpful in storing Big
---------------------- data.
BI solutions carry the data to the Big data solutions take the processing
----------------------
processing functions. functions to the data sets. Because the
---------------------- analysis is around the data, it will be
easy to handle large amounts of data.
----------------------
148 Introduction to Data Science, Machine Learning & AI
Business Intelligence Big Data Notes
BI can handle the data sets structured Big data can handle all types of data ----------------------
in a relational database with additional – structured, semi-structured and
indexes and forms of access to the unstructured. ----------------------
tables in the warehouse.
----------------------
BI processes the historical data sets. Big data solutions can process not
only historical data, but also the data ----------------------
coming from real-time data sources.
----------------------
10.7 ADVANTAGES AND APPLICATIONS OF BUSINESS ----------------------
INTELLIGENCE
----------------------
BI benefits the business organizations in many ways in order to take proper
----------------------
decisions and actions. The organizations can create reports on different issues
with a single click by saving lots of time and resources. BI helps in making ----------------------
employees more productive on their tasks and thereby boosting productivity. It
helps in creating visibility of various processes and operations and identifying ----------------------
the areas to be improved. BI gives a bird’s eye view of the organization with the
----------------------
help of dashboards and scorecards. It facilitates several features like predictive
analysis, computer modeling and benchmarking. It creates suitable environment ----------------------
to take up analytics easily.
----------------------
The following are the major benefits of BI:
Making better business decisions; ----------------------
Faster and more accurate reporting and analysis; ----------------------
Improved data quality;
Reduced costs; ----------------------
Increase revenues; ----------------------
Improved operational efficiency;
----------------------
Measure team performance;
Make predictions for the future; ----------------------
Making more assertive decisions about practices to follow; ----------------------
Helps in streamlining the business processes;
----------------------
Activity 2 ----------------------
Develop a framework for BI in a manufacturing firm/ bank to issue loan. ----------------------
----------------------
----------------------
----------------------
----------------------
Business Intelligence 149
Notes
Check your Progress 1
----------------------
Answer the question.
----------------------
1. Discuss the difference between BI and BA.
----------------------
---------------------- Summary
---------------------- ●● BI consists of a set of processes, architectures and technologies to
---------------------- convert raw business data into meaningful information that helps in
making proper decisions and driving profitable business actions. BI helps
---------------------- the organizations to identify market trends, spot business problems and
current business needs. BI comprises of technologies and strategies
---------------------- incorporated by the business organizations in order to collect and analyze
---------------------- historical and current data. It can become a part of Business Analytics
and Big data by taking a part of Big data, processing it and generating
---------------------- important insights.
----------------------
Keywords
----------------------
●● ERP: Enterprise Resource Planning
---------------------- ●● KPI: Key Performance Indicators
---------------------- ●● ETL: Extracting, Transforming and Loading
●● OLAP: Mining the Data through Online Analytical Processing
----------------------
●● BA: Business Analytics
---------------------- ●● DWH: Data Warehouse
----------------------
Self-Assessment Questions
----------------------
1. Describe the process of Business Intelligence.
----------------------
2. Distinguish between Business Intelligence and Big Data.
---------------------- 3. Compare Business Intelligence with Business Analytics.
----------------------
Suggested Reading
----------------------
1. “Too Big To Ignore: The Business Case For Big Data” by Phil Simon.
----------------------
2. “Data Science For Business: What You Need To Know About Data
---------------------- Mining And Data-Analytic Thinking” by Foster Provost & Tom Fawcett.
---------------------- 3. “Performance Dashboards – Measuring, Monitoring, And Managing Your
Business” by Wayne Eckerson.
----------------------
4. “Performance Dashboards – Measuring, Monitoring, And Managing Your
---------------------- Business” by Wayne Eckerson.
150 Introduction to Data Science, Machine Learning & AI
Web Analytics
UNIT
11
Structure:
11.1 Introduction to Web Analytics
11.2 Benefits of Web Analytics
11.3 Process of Web Analytics and Maturity Model of Web Analytics
11.4 Best practices of Web analytics
11.5 Web Analytics Tools
11.6 Types of Web Analytics
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
Web Analytics 151
Notes
Objectives
----------------------
After going through this unit, you will be able to:
----------------------
• nderstand the concept of web analytics; Understand how to perform
U
---------------------- web analytics;
---------------------- • ave knowledge about the different between web analytics and data
H
analytics.
----------------------
----------------------
11.1 INTRODUCTION TO WEB ANALYTICS
----------------------
Web analytics measures, collects, analyzes and reports web data for
---------------------- the purposes of understanding and optimizing web usage. It is the process of
---------------------- analyzing the behavior of visitors to a website. It enables a business to attract
more visitors, retain or attract new customer for goods or services or to increase
---------------------- the money volume each customer spends. It is the methodological study of
online/offline patterns and trends and is carried out to analyze the performance
---------------------- of a website and optimize its web usage. It helps in tracking key metrics and
---------------------- analyze visitors’ activity and flow of traffic through the website. For example,
if a customer visits a website dealing with sales of clothes, the web analytics
---------------------- software immediately logs measurable information about the customer and
customer’s computer device. The software engineers and analysts use that
---------------------- data to properly analyze the customer’s habits, wants, likes and dislikes. Web
---------------------- analytics represents this general idea of data collection and the subsequent
analysis of such data.
---------------------- Web analytics is helpful in understanding an optimizing web traffic and
---------------------- usage. The examples of web analytics data include:
Demographics
----------------------
Geographic location of visitors
---------------------- Where traffic is being referred from Click analytics
---------------------- Bounce Rate
Total Number of Webpage Visits.
----------------------
Currently, most of the business organizations are maintaining their own
---------------------- websites to exhibit their strong online presence in order to grow their business
with increased exposure and better communication with potential and existing
---------------------- customers. Such websites need to be developed with a strategy to capture
---------------------- important information about the customers, their preferences and interests and
other valuable details useful for the growth of business. Web analytics not only
---------------------- install a script on a website, but also set up a monthly report to show the number
of visitors on the website It requires analysis and all the numbers it delivers base
---------------------- on making informed decisions. Effective web analytics lead to pulling high-
---------------------- quality traffic to the website. It also leads to optimization of online strategy
152 Introduction to Data Science, Machine Learning & AI
and defining future goals. The reporting and decision making evolved from Notes
retrospective analytics (last month report) to goal-driven analytics (campaign
wise metrics) to real-time analytics (serving personalized content). ----------------------
----------------------
11.2 BENEFITS OF WEB ANALYTICS
----------------------
With Web analytics, the companies can be benefited in many ways. They can:
ssess web content problems so that those problems can be rectified;
A ----------------------
Have a clear perspective of website trends; ----------------------
Monitor web traffic and user flow;
----------------------
Demonstrate goals acquisition;
Figure out potential keywords; ----------------------
Identify segments for improvement;
----------------------
and Find out referring sources.
----------------------
Web analytics compares conversion rate of traffic from paid advertisement,
social media and search engine traffic. It can determine which country/market ----------------------
can yield the highest conversion rate and discover the landing page which has
the best or worst conversion rate and also the highest drop-off. ----------------------
----------------------
11.3 PROCESS OF WEB ANALYTICS AND MATURITY
MODEL OF WEB ANALYTICS ----------------------
The web analytics process consists of different stages and important of ----------------------
them include (i) Define goals, (ii) Build KPIs, (iii) Collect data, (iv) Analyze
----------------------
data, (v) Test alternatives, and (vi) Implement. The goals represent the objectives
of web analytics and focus on what types of analysis to be performed. The key ----------------------
performance indicators (KPIs) need to be specified in order to make the web
analytics effectively and meaningfully. As per the goals and KPIs defined, the ----------------------
relevant web data needs to be collected and analyzed properly. Web analytics
----------------------
can help you analyze the various key performance indicators that help to drive
business by monitoring: ----------------------
Traffic sources like which search engine ,frequent keywords and referral ----------------------
sites that bring you most traffic.
The number of unique visitors who visit your site and the sessions they ----------------------
make. The number of page views and time spent on your page. The path which ----------------------
visitors followed while browsing.
Collection of data of the web on the following points will be very much ----------------------
helpful to web analytics: ----------------------
Time of visit
----------------------
How you got to the website
(also called referral URL) What you searched for ----------------------
----------------------
Web Analytics 153
Notes Time on each page
How many times you visit
----------------------
Location of computer
---------------------- Web Browser
---------------------- Total Length of time on website
---------------------- Appropriate analysis should be conducted by suitable methods and such
methods should be evaluated. The best suitable alternative should be selected
---------------------- and implemented to perform effective analysis of web data.
---------------------- 5-step maturity model of web analytics:
---------------------- 1. Tracking hygiene metrics: A page view results when a visitor visits the
website and loads a page on it. Any reloads and refreshes can increase
---------------------- page views. Such page views are very common and well-known metrics.
Enrolments and conversions act as outcome-based metrics and are decided
----------------------
based on business requirements. These contribute to deciding strategy.
---------------------- For most websites, organic search is a significant driver of traffic and their
ranking is decided by site performance.
----------------------
2. Understanding user behavior: Experimentation in the form of making
---------------------- hypotheses is a crucial step in building a successful website. A website
is implemented for a short period of time, gets field validation and rolls
---------------------- out the best performer subsequently. Identification and analysis of most
---------------------- frequent paths visited by users gives an insight into their intent and drives
better understanding of user’s desires. All these can lead to generation
---------------------- of new content and optimization of interface of the existing website.
Well-designed dashboards can present metrics in an intuitive and easy to
---------------------- understand way for the decision makers and leverage web data.
---------------------- 3. Enabling digital marketing: The key metrics relevant to each of the
digital marketing methods such as Email, display advertising and social
----------------------
engagement can be different. The campaign results can be assessed by
---------------------- defining the metrics that matter to each campaign. The metrics include
breakdown of traffic sources, channel attribution and custom goal-driven
---------------------- dashboards.
---------------------- 4. Embracing one-to-one marketing: Personalization of interactions and
personalized communication promote greater engagement, loyalty and
---------------------- thereby better return on investment made on marketing. Technologies
---------------------- like artificial intelligence and predictive analytics are very much helpful
in serving dynamic personalize content, landing pages and offers. Cross-
---------------------- selling and up-selling opportunities can be enhanced and customized
offers be made.
----------------------
5. Drive business strategy: Web analytics becomes a major component of
---------------------- marketing analytics that aligns with the overall business strategy. Speed
and agility or response times are also important components. Advanced
----------------------
154 Introduction to Data Science, Machine Learning & AI
web analytics can play a significant role in the areas of product and service Notes
innovation, price optimization and portfolio optimization.
----------------------
11.4 BEST PRACTICES OF WEB ANALYTICS ----------------------
There are several best practices prescribed to make web analytics a ----------------------
successful tool. They include:
Encouraging a data-driven environment for decision making: Once the ----------------------
relevant data is collected to answer whether the set goals are met, the ways to ----------------------
improve the KPIs should be identified. With the help of experimentation and
testing tools, different solutions can be tried so as to find the best one that can ----------------------
generate the most engagement for the page.
----------------------
Avoiding only providing traffic reports: Mere provision of traffic reports
like reports about visits, page views, top sources or top pages will not decide the ----------------------
success of web analytics. Large numbers can be misleading.
----------------------
Always providing insights with the data: Making the data relevant and
meaningful with valid insights drawn by making the website data demonstrate ----------------------
areas of success and improvement contribute to the success of web analytics.
----------------------
Avoid being snapshot-focused in reporting: Pan-session metrics like
visitors, user-lifetime value and other values provide longer-term understanding ----------------------
of people and users and allow for evaluation of performance and maturity level
----------------------
of the website.
Clear communication with stakeholders: With the consistency in the ----------------------
information provided, the audience and the weaknesses of the existing system ----------------------
can be understood and disclosed to the stakeholders.
----------------------
11.5 WEB ANALYTICS TOOLS
----------------------
There are many web analytics tools available in the market, some of them
----------------------
are free and some are priced reasonably. The following criteria are applicable
while selecting a tool for use: ----------------------
The specified metrics can be tracked from the beginning; ----------------------
There should be an opportunity to have more advance insights in the
future; The analysis should have impact on the page speed of the website; ----------------------
Vendor should support privacy and data protection; ----------------------
The tool should provide ease of use and enable the team make of the ----------------------
insights.
----------------------
The free web analysis tools include Google webmaster tools, Sitemeter,
CrazyEgg, StatCounter and Compete. Google webmaster tools are useful to ----------------------
understand the behavior of the visitors and track the traffic of search engine
of the organization. It also can provide analysis report of the traffic coming ----------------------
from the search engine. Site meter facilitates counter for real time traffic on the
----------------------
Web Analytics 155
Notes website. CrazyEgg enables the organization understand the clicks and visits
of the readers of different parts of the website. Stat Counter provides free hit
---------------------- counters and visitor tracking. Compete helps organization to compete with
other websites by giving the details on unique visitors, reference of site sources
---------------------- and keyword terms of search engines. Alexa provides reports on traffic analysis
---------------------- of all bloggers and webmasters including search analytics and demographics
of the web traffic. Google Analytics helps in assessing the popularity of the
---------------------- website and whether the initiated promotion is performing well in bringing
traffic to the website and gauging the instant effects of blog posts and tweets.
----------------------
---------------------- 11.6 TYPES OF WEB ANALYTICS
---------------------- There are different types of web analytics possible and include content
web analytics, social web analytics, mobile web analytics and conversion web
---------------------- analytics.
---------------------- Content web analytics is a tool that is useful to have complete understanding
of the type of content that can fetch more visitors, the content that needs some
---------------------- critical revision and the popular web pages. Such content can be developed by
---------------------- knowing the page that gets visited often, the length of stay in the page and the
visitors that became customers. The success of content web analytics depends
---------------------- on the extent of understanding how the visitors are flowing throughout the
website.
----------------------
Social web analytics focus on different social media platforms, because
---------------------- the internet is a great social place. This type of analytics measures how
successful the campaigns are in the social media world. They keep track of the
----------------------
number of people sharing the web pages and blog posts with their networks,
---------------------- the locations where the sharing is taking place and the social networks that are
sending business to the website.
----------------------
Mobile web analytics focus on the surfing activity of the users on the web
---------------------- through their mobile phones. It helps in understanding whether the customers
are engaged and more likely to buy depending on their presence with their
---------------------- phones or other computer systems. Such information enables the organization
to look at changes to be done on the mobile side or in the advertising efforts.
----------------------
Conversion web analytics help in identifying the people who are buying,
---------------------- downloading, playing videos or doing other actions and also understanding the
visitors who are making purchases and who are not making. All such information
----------------------
will be helpful in setting up goals and measure the extent of meeting them,
---------------------- revise and adapt them to improve business.
----------------------
----------------------
----------------------
----------------------
156 Introduction to Data Science, Machine Learning & AI
Notes
Check your Progress 1
----------------------
State True or False.
----------------------
1. The key metrics relevant to digital marketing metrics include
breakdown of traffic sources, channel attribution and custom goal- ----------------------
driven dashboards.
----------------------
2. It is best practice of web analytics to use snapshot-focused in reporting.
----------------------
----------------------
Summary
----------------------
●● Web analytics measures, collects, analyzes and reports web data for the
purposes of understanding and optimizing web usage. Web analytics ----------------------
is helpful in understanding an optimizing web traffic and usage. Web
analytics is the measurement, collection, analysis and reporting of web ----------------------
data for purposes of understanding and optimizing web usage. Appropriate
----------------------
analysis should be conducted by suitable methods and such methods
should be evaluated. The best suitable alternative should be selected ----------------------
and implemented to perform effective analysis of web data. Mobile web
analytics focus on the surfing activity of the users on the web through ----------------------
their mobile phones. All such information will be helpful in setting up
----------------------
goals and measure the extent of meeting them, revise and adapt them to
improve business. ----------------------
----------------------
Keywords
----------------------
●● KPI, Hygiene metrics, data driven, Traffic sources.
----------------------
Self-Assessment Questions ----------------------
1. Describe the process of web analytics. ----------------------
2. Compare data analytics with web analytics.
----------------------
3. Give some industry examples to understand the significance of web
analytics. ----------------------
----------------------
Answers to Check your Progress
----------------------
State True or False.
----------------------
1. True
2. False ----------------------
----------------------
----------------------
Web Analytics 157
Notes
Suggested Reading
----------------------
1. Web Analytics: An Hour a Day: Avinash Kaushik
---------------------- 2. Web Analytics Action Hero: Using Analysis to Gain Insight and Optimize
Your Business: Brent Dykes
----------------------
3. Web Analytics Kick Start guide : Brent Dykes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
158 Introduction to Data Science, Machine Learning & AI
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
Web Analytics 159
Notes
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
----------------------
160 Introduction to Data Science, Machine Learning & AI