88% found this document useful (17 votes)
9K views102 pages

DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z

Uploaded by

Prakhar Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
88% found this document useful (17 votes)
9K views102 pages

DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z

Uploaded by

Prakhar Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Data Analytics

A Comprehensive Beginner’s Guide


To Learn About The Realms Of
Data Analytics From A-Z
Table of Contents
Introduction
Chapter One: Working with Data
Defining Data
Understanding Various Data Types and Structures
Structured Data
Unstructured Data
Chapter Two: Introduction to the World of Big Data
Big Data- Big Value!
The Big Data Chronicle
Where We Are
The Dramatic Popularity of Big Data
The Emergence of Digital Data-Generating Devices
The Internet of Things and Machine Data
Chapter Three: A Snapshot Into The World Of Data Analytics
The History of Data Analytics
Welcome to The World of Data Analytics!
Data Analytics vs. Data Analysis – Any Discrepancy?
Data Analytics vs. Data Science
Business Intelligence vs. Data Analytics
The Business Use of Data Analytics
Data Analytics Tools
Chapter Four: Data Analytics Vs. Business Analytics
Understanding Business Analytics
Hey! Business Analytics is Not Data Analytics
Essential Components of Business Analytics
Use Cases and Implementation of Business Analytics
Predictive Conservation- Using Shell Plc as a Case Study
Predictive Delivery - Using Pitt Ohio as a Case Study
Chapter Five: Gaining Insights Into the Various Types of Data Analytics
Exploring Types of Data Analytics
Descriptive Analytics – What Happened?
Making Use of Descriptive Analytics
Inferential Statistics in Descriptive Analytics
Diagnostic Analytics – How it Happened
Predictive Analytics – What Can Happen?
Why Predictive Analytics is Important
Real-Life Use Cases of Predictive Analytics
Prescriptive Analytics – What Should be Done?
Chapter Six: Exploring Data Analytics Lifecycle
Overview
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicating the Outcomes
Phase 6: Operationalize
Chapter Seven: Wrapping Your Head Around Data Cleaning Processes
What Exactly is Data Cleaning?
The Common Component in Data Cleansing
Detecting Outliers With Uni-Variate and Multi-Variate Analysis
Extreme Values Analysis
Chapter Eight: Unraveling the Role of Math, Probability and Statistical
Modeling in the World of Data Analytics
Understanding Probability and Inferential Statistics
Probability Distributions
Common Attributes of Probability
Calculating and Measuring Correlation
Pearson's R Correlation
The Spearman Rank Correlation
Exploring Regression Methods
Logistic Regression
The Ordinary Least Square Regression Method
The Time Series Analysis
Recognizing Patterns in Time Series
Chapter Nine: Using Machine Learning Algorithm to Extract Meaning
From Your Data
What is Machine Learning?
How it Relates to Our Subject Matter (Data Analytics)
Machine Learning Approaches
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Chapter Ten: Designing Data Visualization That Clearly Describes
Insights
Understanding Data Visualization
Data Storytelling For Corporate Decision-Makers
Data Visualization For Analyst
Building Data Art for Activists
Meeting the Needs of Your Target Audience
Brainstorm, Man!
Step 2: Describe the Intent
Step 3: Use the Most Practical Form of Visualization For Your Task
This can be achieved by choosing from the three key styles of visualization:
data storytelling, data illustration, and data art.
Picking the Most Suitable Design Style
Creating a Numerical, Reliable Response
Garnering a Strong Emotional Reaction
Adding Context to Your Visualization
Choosing the Best Data Graphic Type For Your Visualization
Standard Chart Graphics
Comparative Graphics
Statistical Plots
Some Popular Data Visualization Tools
Chapter Eleven: Exploring Data Analytic Methods Using R
Understanding the Open Source R Programming Language
R’s Common Vocabulary
• Non-Interactive
R Studio
Understanding R Data Types
Exploring Various R Variables/Objects
Taking a Quick Peep Into Functions and Operations
Understanding Common Statistical and Analytical Packages In R
Exploring Various Packages for Visualization, Graphing, and Mapping in
R
Conclusion
© Copyright 2020 - All rights reserved.
The contents of this book may not be reproduced, duplicated, or transmitted
without direct written permission from the author.
Under no circumstances will any legal responsibility or blame be held against
the publisher for any reparation, damages, or monetary loss due to the
information herein, either directly or indirectly.
Legal Notice:
This book is copyright protected. This is only for personal use. You cannot
amend, distribute, sell, use, quote, or paraphrase any part of the content
within this book without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. Every attempt has been made to provide
accurate, up to date, and reliable information. No warranties of any kind are
expressed or implied. Readers acknowledge that the author is not engaging in
the rendering of legal, financial, medical, or professional advice. The content
of this book has been derived from various sources. Please consult a licensed
professional before attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances are
the author responsible for any losses, direct or indirect, which are incurred as
a result of the use of information contained within this document, including,
but not limited to, —errors, omissions, or inaccuracies.
Introduction
Welcome to the Realms of Data Analytics!
If you believe that the modern environment generates even more data than the
past decades, then you will agree with me that sometimes, it's not enough to
retrieve data from the millennia for modern consumption. The rate of
analytics required in today’s growing world must be collected, evaluated, and
quickly utilized for individual and business consumption.
Data has tremendous value opportunity: Groundbreaking perspectives,
broader understanding of challenges, and innumerable possibilities to
forecast, and perhaps even mold the future. Data analytics is the primary
means of identifying and harnessing these abilities.
Data analytics offers strategies for dealing with and learning from Data -
identifying trends, finding connections, and making sense of incredibly
diverse images and knowledge. This book offers an insight into some of the
main methods, strategies, and resources employed in Data analytics.
Learning these approaches will help readers become regular commentators to
activities related to data analytics. The content of the book is intended to aid
various interested parties like data analysts, as well as business-centric data
analysts seeking to add Data analytics expertise to their asset base.
To be frank, until now, the field of data analytics has been occupied by some
so-called data analytics gurus who seem to address the subject in a way that
is excessively overly verbose and intimidating. Hey! Basic data analytics is
not that complicated or hard to comprehend, as most people believe it is.
Here is the good news – if you are reading this, you are a step ahead into
learning the essentials of data analytics.
This book serves as a ‘quick-start’ guide that will walk you through the
massive and diverse fields of data analytics. If you are entirely new to the
world of data analytics, this book is for you. So buddy, why not allow me to
walk you into the realms of data analytics!
Chapter One: Working with Data
As we delve deeper into the digital age, we ultimately find ourselves in a
world that is profoundly data-rich and data-aligned. All activities performed
by an organization today are determined in one way or the other by data and
analytics. Most businesses have started the incorporation of advanced
analytical methods into potential business growth areas, to simplify
operations, boost operational margins, make smarter judgments about human
resources, and design efficient budgets. The impact of data extends into our
daily lives, Medicare, the economy, and more.
Although the book's major focus is on data analytics, understanding the
fundamentals and the first key words of data analytics is a must.
The first chapter of this book takes a broad look at data as well as other
essential factors connected to it. Understanding this essential component of
data analytics serves as an ‘entry point’ into the world of data analytics.
Defining Data
The first step towards the use of data analytics as useful choice-making
support is to grasp the workings of data fully, how it is obtained, stored, the
various types of data and their attributes.
When evidence, detail, and more specifically, information is in "raw" and
"disorganized" form, it is referred to as data. These data are often evaluated,
while useful evidence is obtained from them.
When we hear that word "data," many of us automatically envisage
spreadsheets and tables. Hey! Data is not in any way limited to figures. Most
data obtained in today's world is as diverse and complex as Facebook likes
and even the location reported by your mobile device. Data can be numerical,
txt-based, sound, or visual, and the volume of data that humans are now
collecting is a drastic increase from just a few years ago.
Understanding Various Data Types and Structures
As stated earlier, data can come in various variants, like structured and
unstructured data. These data structures may include financial data,
multimedia files, and genetic mappings. In contrast to the conventional data
analysis carried out by companies, most of today's Data is unstructured or
semi-structured, requiring different techniques and tools to be interpreted and
analyzed. Distributed computing environments and massively parallel
processing (MPP) frameworks that allow multi-threaded data intake and
Analysis are the preferred approaches to the Analysis of these large data sets.
With all this in mind, this section takes a good look at the two common data
structures (structured and unstructured data) that are commonly used during
data analytics processes.
Structured Data
For tech nerds and developers, structured data may appear boring for data
analytics processes. This data structure adheres to a predefined data model;
thus, they are easy to analyze. Structured data adheres to the table format -
the relationship between rows and columns. Excel files or SQL databases are
two prominent examples of structured data. Both (excel files and SQL
databases) consists of structured rows and columns which can be easily
ordered and categorized.
Structured data relies on the existence of a data model–a specification for
how data can be organized, processed, and interpreted. With the help of a
data model, each domain is discreet and can be viewed independently or in
combination with data from several other areas. It makes structured data
exceptionally powerful: For example, Data from multiple locations in the
database can be collated easily.
Structured Data is regarded as the most' traditional' type of data storage. This
is because the oldest implementations of DBMS were capable of storing,
processing, and accessing structured data. Semi-structured data barely
adheres to the standards of data structures that are aligned with relational
databases or other types of data tables, but instead includes tags or specific
markers to distinguish semantic components and impose record and field
hierarchies within the data. This is also known as the self-descriptive
structure. Semi-structured data examples may include JSON and XML. It is
specifically because semi-structured Data is much easier to interpret than
unstructured data that this third category exists (between structured and
unstructured data). Most Big Data applications and software can' interpret'
and either process JSON or XML. Similar to unstructured data, this lowers
the difficulty of analyzing structured data.
There are various types of data under Structured Data. Generally speaking,
several programming languages group data into three different categories –
numeric, character-based and date-based, or time-based.
Numeric Data: Just like the name implies, numeric data are
measurable information. These data types are usually collected in
number form rather than a natural language statement. Often labeled
quantitative data, numerical Data is often obtained in the form of a
number. Numerical Data is distinguished from other data types based
on its ability to perform mathematical calculations with these figures.
For instance, numerical Data on the number of men and women in the
hospital can be taken, and afterward, summed together just to
determine the total number of patients in the hospital. This function is
one of the main ways to classify numerical data.
There are two forms of numerical data, namely, discrete data –
commonly used to describe countable objects and continuous data –
used in describing various types of data metrics. The continuous
numerical data type is further partitioned into interval (3rd level of
measurement) and ratio data (4th level of measurement).
Discrete Data: Discrete Data is one of the most common types of
numerical data. This numerical data type is used to describe items that
can be counted. These are those variables of figures that can be easily
clustered into a list, where these lists are either defined or undefined.
Regardless of the nature of these lists (that is, finite or infinite),
discrete data take on countable figures like one to infinity or one to ten.
These number groups are regarded as countable infinite groups and
countable finite groups. A more pragmatic instance of discrete data can
be likened to counting the bowl of sand that can be used to fill a tub
and counting the bowl of sand required to fill the sea. In essence, the
latter is infinitely countable, while the latter is finitely countable.
Continuous Data: This is another form of numerical data that can be
used to describe data measurements. Unlike discrete data, continuous
data values are often defined as intervals. For instance, the common
student scoring system which employs a five-point scoring system
identifies top-performing students (First class) as those whose
cumulative grade point average falls between 4.5 and 5.0. a student
with a cumulative grade point average between 3.5 - 4.49 are regarded
as second class upper students, while those with 2.5 and 3.49 are
labeled as a second class lower. Hence, a student may have a
cumulative grade point average like a point 3.272, 3.62, or any other
point from 0, which is the lowest, to 5, which is the highest possible
number. In such a scenario, this continuous Data is labeled as
uncountable finite data. Continuous data are further divided into ratio
and interval. I will elaborate more on them as we proceed.
Asides from the discreet and continuous data type, numerical data are
also categorized using different scales of measurement. These scales of
measurements are classified into four major types. They include ratio,
ordinal, nominal, and interval. Nominal and ordinal characteristics are
generally regarded as categorical data types, while interval and ratio
are labeled numeric type of data. Data of a specific attribute can be
transformed into another. For instance, the value of gold {Fair, Good,
Extremely Good, Perfect, Sustainable} is ordinal but can be
transformed to a nominal attribute {like Good and sustainable} with a
specific scaling. Likewise, a ratio attribute like Age can also be
transformed into an ordinal attribute like {Newborn, Child, Teen,
adult}. While these attributes may not be entirely useful in various
analytical tools, it is essential to have an in-depth knowledge of the
attribute form and category in the data set. This ensures that the correct
inferential statistics, analytical strategy, and techniques are
implemented and correctly defined. In other words, understanding
these common data attributes helps to avoid basic data analytics
mistakes
Nominal (first level of measurement): The nominal Scale, also
referred to as the categorical variable scale, is described as the scale
used in the identification of variables in separate categories and does
not require a numerical value or order. This level of measurement is
regarded as the easiest of all four scales of measurement. Mathematical
computations made on such variables would be pointless since there is
no numeric value attached to this measurement scale.
In some cases, this scale of measurement is used for classification
purposes–the numbers associated with components in this scale are just
labels for classification or grouping. However, Calculations based on
such numbers would be meaningless because they have no objective
meaning.
The scale is mostly used in study surveys and questionnaires in which
only variable tags or identifiers are meaningful.
For example, a satisfaction survey may be obtained on which mobile
phone brand is preferred. In so doing, this may be grouped into the
following: Apple "—1, "Samsung" —2," LG "—3.
In this research case, only brands name are relevant to the analyst
performing market research. There is no need for a particular order
with respect to these brands. Furthermore, when extracting nominal
data, analysts are performing an assessment based on the associated
mark-ups or tags.
Of the most commonly used analytical tool for the nominal scale of
measurements may include mode, frequency, chi-square, counts, and
cluster analysis.
Ordinal (second level of measurement): The Ordinal Scale is known
as a variable scale of measurement explicitly used to represent the
ordering of the variables and not about the discrepancy that exists
between the variables. These measurement scales are usually used to
reflect non-numerical ideas like frequency, pleasure, joy, level of pain,
etc. It is effortless to note the application of this scale since' Ordinal'
sounds close to' Order,' which is specifically the intention of this scale.
Ordinal scale preserves description properties along with an inherent
order, but when it relates to the origin of the scale, it is empty, and thus
the distance between the variables cannot be measured.
Description properties suggest labeling properties, which is quite
similar to that of the nominal scale of measurement. On the contrary,
the ordinal scale seems to have a relative location of variables. The
source of this scale lacks due to the absence of "true zero." An example
of this scale of measurement may include the rating a product quality,
level of satisfaction, etc. The statistical tools outlined in the nominal
scale of measurement in addition to other analytical like rank
correlation, median, Kruskal Wallis, and non-parametric tests, can be
used in analyzing these types of data.
Interval (3rd level of measurement): Interval Scale is identified as a
numerical scale in which the order of the variables and the discrepancy
that exists between the variables are defined. Variables with known,
constant, and computable variations are graded using the Interval scale.
It is important to note the primary function of this scale as well.
Generally, Interval' means' distance between two parties,' and this is
precisely what the Interval level of measurement aims to achieve. The
Interval scale includes all of the features of the ordinal scale. Such
scales are useful because they open doors for statistical Analysis of the
data received. Mean, median, or mode can be used to measure the
central trend in this scale. The main downside to this scale is that there
is no fixed starting point or absolute zero value. All tools used in
analyzing the ordinal scale of measurement plus arithmetic mean,
parametric tests(like t-tests, and proportions) correlation, Analysis of
variance, factor analysis, and regression can be used in analyzing
datasets under this level of measurement. A quick example of an
interval scale of measurement is calendar date and time, temperature,
etc.
Ratio (4th level of measurement): Ratio Scale is described as a
variable scale of measurement which generates the variable order. It
also allows the discrepancy between variables known, including the
information on the value of true zero. It is estimated by assuming that
the variables have a choice for zero and that the discrepancy between
the two variables is similar, with a different order between the choices.
Inferential and descriptive analysis tools can be easily applied to
variables with the choice of true zero. More so, since a ratio scale can
perform all the tasks that nominal, ordinal, and interval Scales can do,
the value of absolute zero can also be calculated. An example of a ratio
scale of measurements may include weight and height. Throughout the
market analysis, a ratio scale is used to measure market share, annual
sales, the price of the next product, the number of buyers, etc. The
analytical tool used for nominal, ordinal, and interval scale of
measurement as outlined above, plus correlation coefficient, the
geometric and harmonic mean, can be used in analyzing this data type.
Character Data: Another broad data category in data analytics is the
character. A typical example of this data type may include a brand
name. Generally, the idea here is to utilize some form of numerical
operation to convert this data type from qualitative to numeric data
type.
Nevertheless, it is also important to note that not all character data can
be easily converted into numerical data. For instance, while it is quite
easy to assign number values of 1 and 2 to male and female gender
characters (for data classification), some other descriptive character
variables cannot easily take up any data variables. E.g., character
variables like customer address cannot take up any numeric value.
Date/Time Data: This is another type of structured data. This data
type is treated as a separate class of data in some of the most
commonly used statistical packages for data analysis. This is
attributable to the fact that most time data are a mix of both numeric
and character data. Just like the name implies, a numerical operation is
performed on a specific date or time—for example, the number of
months since the last hospital visits. As simple as this data type may
appear, it is not left without flaws. An essential drawback of this data
type can be likened to the inability of most analytical tools to recognize
them as dates and time since they are often labeled in a different
format like colons and slashes.
Mixed data: When structured data contains both numerical and
character data types, such data type is known as mixed data. Let's take
a student identification number, for example, which is AQ345689.
Since this contains both numerical data and character, they are
regarded as mixed data types. When performing Analysis with such
data type, the analyst may decide to separate each data type from the
other. Note that this decision to separate such variables is strictly based
on the tool employed by the analyst.
Unstructured Data
Unstructured data are those data that either have no predetermined data
model. Usually, unstructured Data is text-heavy, but may also include data
like dates, numbers, and statistics. This leads to inconsistencies and
contradictions that make it hard to comprehend conventional systems as
opposed to Data stored in structured databases. Unstructured data may
include audio, video, or No-SQL databases. In recent times, the ability to
access and analyze unstructured data has expanded tremendously, with
several emerging technologies and software coming onto the market that can
store different forms of unstructured data. For example, MongoDB is
designed for storing documents. On the other hand, Apache Giraph is
designed to store node-to-node ties.
Throughout the context of Data analytics, the ability to evaluate unstructured
Data is specifically pertinent as a large part of data generated by most
companies is unstructured in nature. Think of images, clips, or documents in
PDF form. A vital driving force behind Big Data's rapid expansion is its
capacity to make valuable deductions from unstructured data.
You may be wondering why I had to explain the nitty-gritty of data. The truth
is – data is a key functional area in the world of data analytics, and without an
in-depth on this key area, there is a limit to what you can divulge when
playing around with data analytics.
Chapter Two: Introduction to the World of Big Data
Technical terms can be daunting and misleading to the ignorant, right? Even
those involved in the field of technology, data, and innovation may constantly
be intertwined with outdated, often confusing, sound jargon. Take data
analytics and Data science areas, for example. Both concepts have something
to do with the popular term "big data," but yet they're different!
This chapter discusses this critical term in data analytics to illustrate what is
meant by data analytics and why most organizations have taken a quick shift
from the usual conventional data into the big data world.
Big Data- Big Value!
In today's world, there is a limit to the level of technological trends that have
captured both the technological and mass media by surprise than "big data."
Throughout the analyst groups to the editorial pages of journalism's most
respected sources, the world appears to be awash in big data projects,
activities, and analyses. Nevertheless, as with many tech fads, the concept has
some chaos that adds ambiguity, complexity, and skepticism when trying to
explain how the approaches will help the organization. Hence, it is crucial to
start with a big data overview.
Data is produced continuously and at an increasing rate. Phones and tablets,
social networking, imaging tools to assess a medical diagnosis - all these, and
more, are generating new data that must be held and processed somewhere
for some reason.
Essentially, having to deal with all of these enormous flows of data is hard.
Still, a much more challenging procedure is analyzing and evaluating these
vast amounts of data, particularly when it does not adhere to conventional
notions of data structure which is, to recognize relevant trends and to extract
valuable information. Such data deluge threats present an opportunity to
change business, politics, science, and daily living.
Three features stand out as describing common attributes of Big Data:

Large volume of data: not just millions or mere thousands of rows,


Big Data also comprises of billions of rows and millions of
columns.
The Complexity of various data types and structures: Big data
encompasses various ranges of new data formats, structures, and
sources, including virtual traces left on the internet and other
information archives for further study.
Velocity of new data generation and growth: Big data can be
represented as high-speed data, with massive data acquisition and
relatively close-real-time analysis.
While volume is the most commonly discussed attribute of big data, the range
and velocity of data usually provide a more suitable description of Big Data.
Big Data is often portrayed as having 3Vs, which are: volume, variety, and
velocity. And because of its size, structure, or complex nature, Big Data
cannot be processed accurately without the help of various data analytics
processes. Hence, using the standard conventional methods of data processes
can barely do much. Big Data issues require innovative tools and
technologies to capture handle and identify business benefits.
The Big Data Chronicle
We have always battled with data storage. Not long ago, we remembered the
most memorable moments at the cost of $1 per picture. We often saved our
favorite TV shows and music and overwrote outdated recordings. Our
machines were always running out of space.
Then boom! The data flow popped up new, inexpensive innovations. We
purchased digital cameras, and we attached our devices to routers. We
preserved more data on inexpensive computers, but we were still constantly
sorting and discarding information. We remained resource frugal, but perhaps
the data we generated was tiny enough to handle.
The data generated kept circulating deeper and heavier. Technology has made
data generation increasingly simpler for everyone. Also, on mobile phones,
roll film cameras have made way for virtual video cameras. We started taking
videos that were never watched.
Devices of higher resolution have traveled through science and commercial
devices. More importantly, the internet started to connect international data
silos, generating new challenges we were unfit to deal with.
The death blow occurred with the advent of a community-driven platform,
like YouTube and Facebook, which unlocked the door to those with a
connected digital device to make almost limitless inputs to the world's data
stores. At this point, storage was the major challenge faced. As we rationed
our energy, computer scientists rationed the power of running the machines.
They started designing computer code that can be used to address science and
industry challenges to explain specific conditions like chemical reactions,
forecasting stock market fluctuations, and reducing the expense of complex
resource scheduling conflicts.
This might take weeks or months for their projects to end, but only the most
well-down companies could afford this sophisticated software required to
address these tough business puzzles.
In the 60s and the 80s, computer scientists started creating high expectations
for developments in the world of machine learning (ML), a form of artificial
intelligence (AI), but on most occasions, their initiatives were stalled,
primarily attributable to data and infrastructure limitations.
To sum it all up, the ability to derive meaning from data has been severely
hampered by twentieth-century technologies.
Where We Are
Some notable innovations had taken place towards the beginning of the 21st
century. A significant example of these innovations is Google. Google was
all about big data, designed to explore the massive data displayed on the
newly minted World Wide Web. Its developers soon discovered ways to
make regular computers work together like supercomputers and released
these findings in a publication in 2003 that laid the foundation for a software
system called Hadoop. Hadoop became the foundation upon which most of
the early big data initiatives in the globe would be founded.
Well, I guess at this point, it is important to take a quick peep at some reasons
why data has evolved so dramatically and why the subject' big data' has
become so popular.
The Dramatic Popularity of Big Data
The amount of data we expose to virtual memory is experiencing massive
growth for the following reasons:

The emergence of digital data-generating devices: pervasive


laptops and smartphones, science sensors, and tons of sensors
through the evolving Internet of Things (IoT)
The dramatic steadily rising cost of digital storage.
The Emergence of Digital Data-Generating Devices
Technology that produces and collects data has become inexpensive and, in
fact, pervasive in nature. These devices, phones, cameras, motion sensors,
etc. have made their way into the world of the mainstream consumer market,
along with those of researchers, industry, and policymakers. Occasionally we
deliberately create data when we shoot videos or upload them to websites. In
some cases, we accidentally create data, leave a digital trail on a site we're
browsing, or bring gadgets that transmit geospatial details to service and
internet providers. Quite often the data does not apply to us at all, but instead,
serves as a record of computer operation or empirical occurrences.
The following points explore some of the key origins and applications of
contemporary data technology.
User Behavior: Whenever you visit the website, the administrator of such a
website will be able to assess the details requested from the website (details
like search terms, keywords selected, links clicked). The page could also use
the JavaScript from your browser to monitor how you communicate with the
page: when scrolling down or hovering the cursor over an object. Websites
utilize this information to analyze visitors properly, and the site can
document records for multiple different ranges of virtual acts. Even in cases
where you don't sign in, and the website does not know who you are, the
observations derived during your web actions are always useful. The more
details the website collects about its user base, the more it can automate
marketing strategies, conversion strategies, and product mixes.
Phones and tablets generate much heavier digital tracks. An application
downloaded on your mobile phone may have connections to the mobile
sensors, such as the Global Positioning System (GPS). Because most device
users keep their devices close to them, devices keep extremely
comprehensive data records of their holder's position and operation intervals.
More so, because smartphones are usually in regular contact with cell
networks and Wi-Fi routers, external parties can easily see the positions of
the owners.
Also, businesses with brick-and-mortar stores are constantly using mobile
impulses to monitor buyers' body movements inside their stores.
Several other businesses are making significant efforts to examine these
virtual paths, especially e-commerce companies that want to understand
online customers fully.
In the past, these businesses may have discarded much of the data generated,
keeping only core occurrences (e.g., successful sales). However, several
businesses are now retaining all the data during each website visit, enabling
them to glance for specific details. The size of this user travel data is usually
many gigabytes (GB) a day for small sites and multiple terabytes (TB) a day
for bigger websites.
We produce data even when offline, via call conversations, or when we're
driving past surveillance cameras in stores, city streets, airports, or roads.
Intelligence and security agencies depend on these results to carry out their
duty.
Content Creation and Publishing: What do you need to publish your
writing? A couple of years ago, a printing press and a network of bookshops
were desired. However, with the inception of technology and the Internet, all
you need is the ability to create a web page. Currently, anybody with a
Facebook or Twitter account can create content with global coverage
effortlessly. The same applies to movies and videos. Today's technology,
especially the internet, has dramatically changed the direction of publishing
and has enabled a huge spike in content creations.
Self-publishing sites for the public, in particular, Twitter, Facebook, and
YouTube have opened up the gates to data generation. Anyone can
conveniently upload content that is often beneficial to most businesses, and
the emergence of Smartphone devices, especially those able to film and
upload videos, increasingly reduced the barriers. Since almost everybody
currently has a personal computer with a high-resolution camera and
consistent network access, data uploads are massive, and in fact, inevitable.
Even youngsters can conveniently share unlimited text or video to the social
domain. Currently, YouTube, amongst the most popular self-publishing sites,
is arguably the largest sole user of data resources. Going with the past
published figures, it is projected that YouTube generates roughly 100
petabytes (PB) of updated data annually, from multiple hundred hours of the
video published every minute. Other streaming sites like Netflix are also not
left behind.
The Internet of Things and Machine Data
Machines are never tired of producing data, and the quantity of connected
machines is increasing at a faster rate. You should consider checking out
Cisco's Visual Networking Index TM, which recently predicted that the world
IP traffic would surpass more than two zettabytes annually by 2020.
Even if there is a cap on the number of cell phones and personal computers in
use, we will continue to incorporate networked processors to the machines
surrounding us.
This worldwide network of linked sensors and processors is regarded as the
Internet of Things (IoT). It incorporates advanced energy meters within our
apartments, sensors within our cars, which often enable us to drive and quite
often interact with our insurance companies. These sensors can also be used
to track soil, water, weather conditions, and digital control systems utilized in
monitoring and automating factory equipment, etc. The number of such
devices amounted to roughly 5 billion in 2015 and is expected to be
somewhere between 20 and 60 billion throughout 2020.
Chapter Three: A Snapshot Into The World Of
Data Analytics
Data has been the major catchword for years.
The volume of digital data that exists is rising at an increasing rate,
multiplying every two years, and in fact, improving the way we work, and
make a decision. According to some recently published articles, data is rising
fast and may even triple as we progress. By 2025, about 5.7 megabytes of
new data will be produced every second for human use.
Whether the data collected are generated by large enterprises or by an entity,
each part of the data needs to be evaluated just so they are valuable to end or
immediate users. But how are we going to do it? Yeah, this is just exactly
where the popular term' data analytics' kicks in.
But then, what exactly is Data Analytics? In this chapter, you're going to get
a taste of this concept. Here, I will be explaining what is meant by data
analytics, how it differs from data analysis, why it is important, as well as the
necessary tools utilized in carrying out every data analytical process.
The History of Data Analytics
Data has always been a significant part of our everyday lives; through
technology advances, we have become more adept at capturing and making
sense of the volume of data generated daily.
Before now, policymakers have used surveys to gather data on urban
development and population increase. This data analysis may even take years
to process, but it was accelerated by the development of aggregating
machines that could read data through punch cards. In the early 70s,
Relational databases were developed to extract information from databases
using a structured query language (SQL). Non-Relational Databases and
NoSQL evolved during the 1990s as the web and search engines like Google
transformed data into effortless and readable search query responses.
Database and data storage around this time led to data mining, which
included data collection from massive, unstructured data sources.
In 1997, NASA (National aeronautics and space administration) researchers
invented the term "big data" to describe massive data produced by
supercomputers. After some decades later( 2005 to be precise), Google
Analytics made it easier to derive valuable insights from internet data,
including on-site time, fresh versus repeat customers, customer statistics, and
web impressions and clicks.
A year later, Hadoop was invented, and soon became one of the earliest tools
to analyze large-scale data.
Nevertheless, with the advent of Amazon Redshift and Google Big Query
over the last decade, data analytics has moved to the cloud. And all sectors,
spanning through healthcare to CPG to financial services, have business
owners emphasizing the impact of data analytics in their business strategy to
stay innovative while increasing their market share.
Welcome to The World of Data Analytics!
The term data analytics describes the process of analyzing databases to garner
inferences about the information on the ground. Data analytical techniques
help you generate raw data and discover trends that are needed in making
reasonable insights about a given data set.
Simply put, it is defined as those Qualitative and quantitative methods and
systems used to increase efficiency and market gains. Data is collected and
classified to define and evaluate cognitive patterns and strategies, which may
differ as a result of organizational requirements." While a lot of data analysts
may look at vast and complex data, also known as "big data," others may
utilize smaller data resources like internal data sets and organization records.
Currently, several analytics strategies, tools, and procedures make use of
specialized systems and software that incorporate machine learning
methodologies, robotics, and other functionality.
Data Analytics vs. Data Analysis – Any Discrepancy?
Data analytics and data analysis are sometimes considered as synonymous
terms but have slightly different definitions.
Essentially, the key distinction between data analytics and analysis is a
function of scale, since data analytics is a wider concept in which data
analysis is just a part. Data analysis offers the possibility of analyzing,
modifying, and organizing a given data set specifically to observe its parts
and extract valuable information. Data Analytics, on the other hand, is an
integrated science or discipline that embraces comprehensive data
management. This involves not only analysis but also data extraction,
organization, management, and all the methods, tools, and strategies utilized
during this process.
The task of data analysts is to capture, analyze, and turn data into valuable
knowledge. Analysts help companies produce valuable strategic decisions by
recognizing general patterns trends. This skill of identifying, modeling,
forecasting, and enhancing productivity has put them at an increasing level of
global demand throughout various sectors and industries.
Data Analytics vs. Data Science
While most people may use the terms uniformly, data science and data
analytics are distinct fields, with a notable difference in scope Data science is
a common term for a group of disciplines that mine large datasets. Analytics
software is a much more centered variant of this and may even be regarded as
part of a more comprehensive operation. Data Analytics is committed to the
realization of actionable insights that can be implemented effectively based
on current observations.
Another important distinction between these two areas is "exploration." Data
science is not associated with resolving basic queries; instead, it digs into
large databases and in a somewhat unstructured way to reveal specific
findings. Data analytics functions best when it is focused, taking into account
issues that require solutions, while relying on the available data. Data science
provides more in-depth perspectives that concentrate on the questions to be
asked, whereas Big Data Analytics stresses seeking responses to queries.
Most specifically, data science appears to be more interested in asking
questions than seeking detailed answers. The area focuses on pinpointing
possible patterns that are dependent on existing data, as well as generating
alternative ways for data interpretation and data modeling.
The two areas are visually distinguishable sides of the same coin, and their
functions are strongly interlinked. For example, Data science provides
relevant frameworks and explores large datasets to produce initial findings,
potential trends, and new information that may be relevant. This information
is relevant in certain areas, especially simulation, enhancing machine learning
and artificial algorithms because it can improve the way information is
processed and interpreted. Furthermore, data science raises relevant questions
that we have not learned before, and at the same time, delivering little in the
line of complicated solutions.
By incorporating data analytics procedures into this mix, we can easily
convert those stuff we think we don't know or understand into actionable
observations for practical applications.
When considering both fields, it is essential to overlook the fact that they are
perceived as data science and data analytics. It is best to consider them as part
of the whole procedure that is essential to delve deeper, not just into the
knowledge we have, but how to correctly interpret and evaluate it.
Business Intelligence vs. Data Analytics
Business Intelligence deals with sophisticated approaches and techniques that
help business owners analyze data and carry out choice-making practices to
improve their business. BI plays a vital role in business data management and
results. Data analytics, on the other hand, is used to transform raw or
unstructured data into a piece of user-friendly, usable information. This
reinvented information can be used to tidy up, convert, or model data to help
the decision-making process, make inferences, and to perform predictive
analysis.
The Business Use of Data Analytics
Data analytics can be effectively utilized for various purposes depending on
the industry. Still, the following points outline some of the most prominent
challenges that most organizations resolve through the help of data analytics.

Sales Projection: Depending on sales growth, past results, and


projected industry trends, businesses can predict future revenue
figures more accurately.
Price Rating: Data analysis lets businesses assess the degree of
responsiveness to a change in the price of various consumer
segments for various goods and services.
Theft and Fraud Prevention: Credit card providers have
traditionally implemented laws to detect potential fraud. With
much more sophisticated data analytics and machine learning
methods, it's easier to spot and forecast illegal activity. This also
extends to insurers, finance as well as other heavy-security
sectors. Security and fraud detection seeks to protect tangible,
financial, and cognitive properties from abuse of internal and
external risks. Effective data analytical tools can provide
maximum rates of fraud detection and overall organizational
safety. In essence, mitigation includes those systems that allow
businesses to rapidly identify possible fraudulent conduct and
predict anticipated activities, as well as to define and monitor
suspects.
Computational, network, pathway, and big data techniques for
analytical scam-propensity alerting frameworks can guarantee prompt
response caused by real-time threat detection protocols and automatic
alerting and prevention. Data management, coupled with effective and
consistent tracking of fraud incidents, can aid in enhanced risk
processing systems for fraud. In addition, the incorporation and data
analysis throughout the organization will provide a unified and
comprehensive view of fraud throughout various divisions of the
business, product, and transaction. Multiple brand analytics and
database provide a much more accurate analysis of theft patterns,
projections, and awareness of possible future modes of operation and
detection of loopholes in fraudulent audits and proceedings.

Marketing Optimization, Profiling, and Timing: Data analytics


can easily depict whether certain marketing strategies, such as
advertising campaigns or social media infographics, contribute to
the desired result. Using CRM systems and demographic data,
businesses can obtain an all-round view of the consumer and
better understand their buying habits, which can lead to tailored
reviews and more targeted engagement.
Proactivity and Optimizing Expectations: Companies are
constantly experiencing market pressure not only to increase their
customer base but also to meet the needs of their customers to
maximize customer satisfaction and build long and lasting
relationships with clients. By revealing their details and enabling
secure privacy about their needs and expectations, consumers
expect businesses to understand them, build meaningful
experiences, and provide satisfying experience across all contact
points. As a result, businesses ought to collect and integrate
different customer IDs, including cell phone, email, and address,
with a single customer ID. Consumers are constantly utilizing
various channels in their dealings with businesses, so both
conventional and digital data sources need to be combined to
determine the behavior of consumers. In fact, consumers expect,
and businesses need to provide relevant real-time experience.

General Optimization of Customer Experience: Poor business


strategy can and will proceed to a plethora of unfavorable issues,
like a substantial risk of undermining consumer service and,
ultimately, brand loyalty. Applying data analytics to design,
process management, and optimize business operations in the
production of goods or services enhances efficacy in meeting
client's expectations and achieving operational superiority.
Analytical approaches can be used to increase operational
effectiveness in operations, as well as to adapt the corporate
workforce based on the most recent market needs and consumer
preferences. Optimum use of data analytics will also see to it that
quality development is undertaken continuously as a function of
the end-to-end perception and analysis of key operating indicators.
For example, most companies supply serves as the primary item in
their current assets category. Hence, too much or insufficient inventory
will influence the actual costs and productivity of a business. Data
analytics may help inventory control by delivering continuous
production, distribution, and/or client service rates at a minimal cost.
The use of data analytics can also layout information on both current
and projected inventory positions as well as detailed information into
the level, structure, and position of the supply users while determining
the best supply approach and the decision-making process. Customers
expect appropriate, satisfying experience and to let businesses know
where they'd be engaged.
Data Analytics Tools
truth be told, Data Analytics is nothing new. Currently, the increasing amount
of data and analytics tools available suggests that you could get a
considerably profound insight into data faster. The observations that big data
and modern technology help facilitate are much more reliable and far more
comprehensive. In comparison to utilizing data to guide potential decisions,
current data may also be used to make urgent decisions.
Some of the tools that make modern data analytics so effective may include
the following: Machine Learning: Artificial Intelligence is an area of
creation and use of software that can create complex analytics that can be
used to perform complex tasks. Machine learning (ML) is just a subsystem of
AI, which is essential in data analytics and comprises algorithms and
architectures that can think on their own. ML allows systems to collect and
interpret data to forecast results without someone directly designing the
framework that is required to make a value judgment. You can test a machine
learning algorithm on a smaller data sample, while the process continues as
more data are collected. I will elaborate more on machine learning as we
proceed.
1. Data Management: It is worth noting that before you analyze any
data, you need to have a working process in place to control data
inflow and outflow from your systems effectively and to keep your
data coordinated. You also need to ensure that your data is of great
quality and that it is stored in a Data Management Platform (DMP),
where it would be accessible when required. Setting up a data
management system will help make sure that your company is
updated on how to coordinate and manage data. Data management is
the process of safe, functional, and cost-effective data collection,
maintenance, and use. Data management aims to aid individuals,
organizations, and related objects automate data use within policy and
regulation boundaries so that they can make choices and take
measures that maximize the organization's profit. With companies
progressively relying on intangible assets to generate value, the whole
process of value judgment cannot be successful without a
comprehensive data management strategy.
Data management at an enterprise includes a wide variety of activities,
strategies, processes, and procedures.
Data management research has a broad scope, addressing issues like
how to:
- Develop, access and modify data over a range of data level
- Store data across various clouds and on-site
- Providing scalability and disaster recovery
- Using data in an expanding range of applications, analytics, and
algorithms
- Maintain data privacy and protection
- Archive and destroy data in compliance with retentions

2. Data Mining: The term relates to the practice of filtering through vast
volumes of data to find trends and to uncover connections throughout
data points. It helps you to search across large data sets while
identifying what's important. You can then use the insight derived
from carrying out analyses and guides your choices. Modern data
mining technology allows you to perform these tasks incredibly fast.
Data mining requires an efficient collection of data and storage. Data
mining uses advanced statistical methods for dividing the data and
determining the likelihood of future events. Data mining is often
called the Knowledge Discovery of data (KDD).
The following are some Core data mining features: Intelligent pattern
forecasts based on the study of patterns and behaviors.
- Projection-based on possible outcomes.
- Agreed knowledge production.
- Emphasis on broad data sets and scientific databases
- Clustering based on observations and classes of information not
previously identified and clearly reported.

3. Predictive Analytics: Predictive analytics technology can help you


evaluate past data to determine potential outcomes and the probability
of different outcomes happening. Usually, these tools use
mathematical algorithms and machine learning. More educated
guesses mean that companies can make more informed decisions by
going forward and aligning themselves to thrive. It enables them to
predict the needs and concerns of their clients, accurately predict
trends, and stay ahead of their rivals.
Data Analytics is one of the key aspects that have succeeded in pushing some
of the largest and best organizations forth in today's world. Companies that
can turn data into practical insights will inevitably be the leaders in today's
overly-competitive world.
For reference sake, let's take a quick look at these two prominent businesses -
Uber and Airbnb. Uber has challenged the taxi-hailing market while Airbnb's
has challenged the hospitality domain. The secret to Uber's drastic revenue
increase is in the power of big data it gathers and incorporates smart decision-
making with the support of Data Analytics. On the other hand, Airbnb used
Data Analytics tools primarily to provide a higher quality user experience.
Both companies are booming with the help of their robust data analytical
approach. Therefore, any company taking advantage of Data Analytics will
defeat its rivals without a glitch.
Chapter Four: Data Analytics Vs. Business
Analytics
Data analytics is important to the researchers and nerds out there, but
infrequent cases, it is only important to most people just because of the
opportunities it can produce. For most business managers and corporate
executives, complex statistical algorithms and analytical approaches can only
appear less significant if they cannot be applied to their organizational
growth. On the other hand, by increasing revenue growth levels and reducing
shortfalls, they are highly interested in discovering new ways to maximize
business income.
However, while business analytics incorporates most data analytical
procedures, the term - business analytics and data analytics should not be
used interchangeably.
In this chapter, I present the market-centered analytics concept (business
analytics), explain how it differs from data analytics, and explain how you
can use data-derived business insights to improve the bottom line for your
company.
Understanding Business Analytics
Business analytics is the practice of gathering, storing, analyzing, and
researching business data, as well as the use of statistical models and iterative
techniques to turn data into valuable business insights. Business analytics
seeks to decide which datasets are valuable and how to exploit them to
resolve issues and improve performance, profitability, and sales.
Business analytics is typically applied as a component of Business
Intelligence (BI), to find actionable data. Usually, business intelligence is
analytical, reflects on the techniques and methods used to collect, classify,
and classify raw data and comment on past or current events. Business
analytics is more prescriptive, committed to the approach by which data can
be analyzed, trends recognized, and models built to explain historical events,
establish forecasts for potential events, and suggest measures to optimize
ideal results.
Business analysts are now using advanced technology, statistical modeling,
and mathematical models to devise strategies for technology-driven
problems. They make use of statistics, information technology, computer
science, and operations analysis to extend their knowledge of big data sets,
artificial intelligence, deep learning, and neural networks to classify existing
data from the micro-segment and trends. This knowledge can then be
leveraged to reliably forecast potential customer behavior or market trends-
related events and suggest actions that will move consumers towards a
valuable outcome.
Hey! Business Analytics is Not Data Analytics
Business analytics, as well as data analytics, includes working with and
manipulating data, extracting knowledge from the data, and using the
information to enhance business efficiency. So, what are the fundamental
differences between the two functions? Data analytics include combining to
identify patterns and trends across large data sets, making assertions about
theories, and supporting business decisions based on data insights. Data
Analytics aims to address concerns like, "What is the geographical or
economic effect on consumer preferences?" And" What is the probability of a
company defecting to a competitor?” Data analytics research includes many
different methods and techniques, and although they are different, it is
sometimes referred to as data analysis, data mining, data processing, or big
data analytics.
On the other hand, Business analytics centers on the wider business
consequences of data, and the behavior that will arise from it, including
whether a company will create a better product line or prefer one venture
over another. The term business analytics refers to a mix of skills, software,
and technologies that enable businesses to quantify and optimize the efficacy
of key business activities like advertising, technical support, marketing, sales
or IT. However, it is important to note that business analytics employs the use
of various data analytics tools like machine learning, data mining, and many
others.
Essential Components of Business Analytics
To better understand the buzzword of business analytics, the following
outlines explain some key components of business analytics:
1. Collection and Aggregation of Data: Before performing any analytical
purpose, data must be compiled for each business need, organized,
cleaned, and filtered. This procedure can help to eliminate redundancy,
remove obsolete data while ensuring that data collected are readily
available for business-specific analytics. Data can be aggregated from:
Transaction Logs: Records that are part of a large dataset held by an
entity or a licensed third party (bank records, sales records, and
shipping records).
Voluntary Data: Data generated by a paper or digital form that is
exchanged either directly by the user or by an approved third party
(usually personal data).
2. Data Mining: Models can be generated by mining through large
quantities of data in the quest to discover and recognize newly
identified trends and patterns. Data mining uses many statistical
techniques for clarity, such as:
Classification: Classification can be employed in cases where
parameters like demographics are defined. This defined parameter can
be used to classify and aggregate data
Regression: Regression is a method utilized in estimating continuous
numeric values based on past trends extrapolation
Clustering: This is employed in cases when data classification factors
are inaccessible, implying that trends must be used.
3. Association and Sequence Recognition: Consumers perform identical
acts concurrently in certain situations, or undertake linearly predictable
acts. This data can show trends like:
Association: For instance, in the same transaction, two separate
products are sometimes bought together; this may include buying a
shaving stick alongside a shaving cream or buying pencils and erasers.
Sequencing: A quick example may involve a customer who makes
payments for certain products, then asks for a receipt.
4. Text Mining: To derive useful relationship metrics, Companies may
also gather textual information from social media platforms, blog
articles, and call center scripts. This data can be used to:
- Create new products on demand
- Boost customer support and experience
- Perform Analysis of the efficiency of competitors
5. Predictive Analytics: Organizations can develop, employ, and handle
predictive rating models while actively resolving events like:
- Customer churn with precision restricted to client age group,
level of income, duration of established account, and
effectiveness of promotions.
- Failure of equipment, specifically in expected periods of
extreme usage or due to exceptional temperature / humidity-
related conditions
6. Optimization: Businesses can define best-case scenarios and the next
best steps by designing and implementing optimization strategies, like:
- Maximum sales pricing and its use of demand peaks to scale
performance and sustain a steady revenue flow
- Stock storage and distribution solutions that enhance delivery
times and consumer loyalty without compromising storage
room.
7. Data Visualizations: Knowledge and observations extracted from data
can be provided with highly engaging graphics to demonstrate:
- Exploratory data analysis
- Performance modeling
- Statistical forecasts.
Visualization features allow businesses to utilize their data in
Deducing and driving new business objectives, increase sales and
enhance customer relationships
Use Cases and Implementation of Business Analytics
In business analytics, the performance also relies on whether or not all of a
company's stakeholders completely embrace implementation and execution.
Effective Business analytics examples— and subsequent implementation of
new predictive-based initiatives— include:
Predictive Conservation- Using Shell Plc as a Case Study
Shell PLC recently introduced artificial intelligence-driven predictive
maintenance to minimize the time lost to system failures. Artificial
intelligence tools determine when servicing engines, switches, and other
machinery is required. They can independently evaluate data to help guide
drill bits via shale formations. They will eventually be able to detect and warn
facility workers of clients' hazardous behavior, minimizing risks from the
drill site to the petrol pump.
The technologies can predict when and where over 3,000 varieties of oil
extraction machine parts could fail while keeping Shell notified regarding the
current location of parts at their facilities globally, and schedule when to
purchase machine parts. Such systems also specify where inventory objects
should be positioned, as well as how long parts should be held before being
rotated or replaced/returned. Since then, Shell has shortened inventory
analyzes from more than forty-eight hours to less than forty-five minutes.
Predictive Delivery - Using Pitt Ohio as a Case Study
Pitt Ohio, a freight company worth $700 million, was significantly influenced
by Amazon's same-day delivery program, which scaled back consumer
expectations. Clients often became more aggressive, calling for up-to-the-
minute monitoring and approximate delivery times that were considerably
shorter than previously appropriate windows. The organization has switched
to data mining to find a way to boost customer interactions.
A cross-departmental project was implemented internally, spanning market
analysis, sales operations, and IT, using unused data. Historical data,
predictive analytics, algorithms calculating freight weight, driving distance,
and several other real-time variables allowed Pitt Ohio to project shipping
times at a success rate of 99 percent. This practice led to an increase in
revenue, which can be attributed to the rate at which the company was
retaining customers.
Chapter Five: Gaining Insights Into the Various
Types of Data Analytics
You may decide to generate all the data on the planet, but if you don't know
how to put them into proper use for your perusal, there's no point in sitting on
that raw information and expect things to get better. Here is a quick remedy -
Data Analytics.
Data analytics helps you gain additional insights while allowing you to create
strategic decisions more accurately.
Data Analytics is, in a sense, the battleground of business processes. It's the
viewing position where you can observe waves and see patterns. The good
thing is that we have performed a quick peep into the world of data analytics.
However, before we delve deeper into the techniques, processes, and
strategies of data analytics, we must make a quick pace into the various types
of data analytics.
Exploring Types of Data Analytics
It is no longer news that we exist in a world where an increasing amount of
data generated every second. If such a volume of data is collected, it is only
normal to have resources that will assist in managing this information. Raw
data usually comes in the form of unstructured information. Data analysts use
their skills to extract valuable information from the data. Here's where
different types of analytics come into the equation.
Data-driven observations play a crucial role in enabling companies to
develop new strategies.
Depending on the implementation process and the form of research needed,
there are four main types of data analysis. They include the following:
Descriptive Analytics – What Happened?
Descriptive analytics aims to display the complexities of relevant information
and to display it in a comprehensible and meaningful way. It is the most
fundamental type of data analytics and shapes the basis of other types of data
analytics.
Descriptive analytics is the most common type of analysis used by companies
and, in fact, the oldest. In the corporate world, it offers the information
needed to make potential predictions, analogous with what security and
intelligence agencies are doing for governments. This kind of research is
often referred to as business intelligence. It involves analyzing data from the
previous data aggregation as well as the incorporation of data mining
techniques to evaluate what has happened so far. It is from this result that
future predictions concerning an outcome can be predicted.
Just like the name implies, Descriptive research simply describes past
occurrences. We can translate these data into facts and figures that are
human-friendly, allowing us to use these data to prepare our future actions by
using different data mining techniques and analyzing these figures.
Descriptive analytics helps analysts gain insights from past events, regardless
of when they occur (whether day or a year ago), and to use these data to
predict how they can influence future actions.
For example, when you can observe patterns that are similar to rising or
falling figures and are conscious of the regular number of product sales that
were made every month in the previous three years, you can predict how well
these trends will affect future sales(whether an increase or a drop). With the
help of descriptive analytics, business owners can note essential factors like
the amount of the company market shares that drops in relation to their
operating cost, how much they spend, and their average revenue. All of these
help business owners cut corners and make more money at the end of the day,
which, of course, is the perfect success slogan in every business.
Making Use of Descriptive Analytics
Descriptive analysts generally turn data into an accessible product, including
graph reports showing the type of patterns a company has witnessed in the
past visually and clearly, which allows the company to predict the potential
outcomes.
A quick example of descriptive analytics is a table of worker's average
salaries in the USA for a specific year. Different companies can use this table
for a variety of uses. This allows for an in-depth insight into American
society and the purchasing power of consumers and has a wide range of
potential consequences. For example, through such a table, we might see that
a surgeon earns much more money than police officers. That data may be
useful in a political campaign or in deciding the target market for a particular
product.
You can use the measure of central tendency and measures of dispersion in
describing a given data set. ⠀ Measures of central tendency simply involve
measuring a given data set and finding the average value of such a data set.
This is calculated by adding up all the data and dividing it by the number of
individual units, producing an average figure that can be utilized in different
ways.
Another unit utilized to calculate the central tendency – which again is
important – is the median. In comparison to the mean, the median simply
considers the middle value in a given data set. For example, the fourth
number is regarded as the median in a string of seven. If you organize all
your data sets from the smallest to the largest, the median can sometimes
serve as a more accurate value than the mean as there may be irregularities at
either end of the spectrum, which can change the value of the mean to an
incorrect value.
Irregularities or Outliers are low or large numbers, which can make the mean
value of a given dataset unrealistic, so the median will be more useful in
situations where there are irregularities or outliers in a data set.
Measuring dispersion or variation helps us to see how a dataset is distributed
or varied from the core or mean value of a data set. The measures used in
calculating dispersion are the range, standard deviation, and variance.
The range is the simplest of all measures of dispersion. It is determined by
deducting the lowest number in a data set from the largest one. This value,
although very easy to calculate, is also very prone to irregularities as you may
have exceedingly low or large numbers at the end of your data array.
Variance is a measure of dispersion that depicts the variation of the data set
from the mean. Variance is generally used to measure the standard deviation,
and may in most cases; they are barely valuable when left on their own. The
variance is determined by estimating the mean value of a given data. In
contrast, the individual value from the data sets is subtracted from the mean,
squaring each of these values to obtain positive values, and then determining
the sum of these squares. If we have this number, we'll divide it by the total
figure of data points in the set, and we'll have our measured variance.
Standard deviation is by far the most common measure of dispersion because
it also represents the average distance of the data set from the mean. The
variance, as well as the standard deviation, would be high in situations where
data is widely dispersed. When calculating the standard deviation, you are
expected to determine the variance. The standard deviation is simply the
square root of the variance. In some cases, this value (standard deviation)may
be a number in the same unit as the original data, making it much easier to
measure than the variance.
These values used in measuring the central tendency and dispersion of data
can be utilized to draw several inferences that can assist with possible
projections generated by predictive analytics.
Inferential Statistics in Descriptive Analytics
Inferential statistics is a part of the research that involves making inferences
based on data obtained from descriptive analytics. Such inferences may be
extended to the overall population or any collective group greater than your
research group. For example, if we conducted research that measured the
levels of depression in a large-pressure situation among youths, we can use
the data we obtained from this study to estimate the overall levels of
depression among other adolescents in similar conditions.
More inferences can be made with the information derived, including
potential levels of depression in older or younger groups, by adding extra
data from several other types of research. Although these could be flawed,
they may still be used with some level of consistency theoretically.
Diagnostic Analytics – How it Happened
Diagnostic Analytics is an inquiry aimed at analyzing consequences and
generating the best response to a specific situation. Diagnostic analytics
involves analytics methods like data mining, discovery, and correlation
analysis. This method is an advanced form of data analytics that answers the
question "why." Diagnostic Analytics provides a deeper insight into a given
data set to try and address the causes of events and actions.
These may include the following processes:
1. Anomaly Identification/Detection: An anomaly is something that
raises the question of its presence in analytics, specifically, does
factors that barely match the standard. It can be a peak operation when
you're not expecting it or a sudden drop in your social media page's
subscription rate.
2. The Anomaly Research: Before you take any action, it is important to
understand how or why the sudden change occurred. This method
includes the collection of sources and the detection of trends in the data
sources.
3. Finding Causal Relationship: After the conditions that caused
anomalies have been established, the next step here is to connect those
dots. This can be achieved through any of the following procedures:
- Regression analysis
- Probability analysis
- Time series analysis
Diagnostic Analytics is also utilized in human resource management to assess
the productivity and performance of workers or applicants for positions.
Comparative analysis can also be used to identify the best-suited candidate
through selected attributes or by displaying trends and patterns in a particular
pool of talent across several criteria (e.g., competence, certification, tenure,
etc.)
Predictive Analytics – What Can Happen?
As you might have inferred from the description above, predictive analytics
is built to forecast what the future brings (to some degree) to display a range
of potential outcomes. In business, sometimes it is somewhat easier to be
proactive instead of reactive. Predictive Analytics helps in understanding
how to make independent corporate decisions that deliver value to
businesses.
In simple terms, predictive analysis is nothing more than the process of
gathering information from data collected and using it to forecast behavioral
trends and patterns. With the aid of predictive analytics, you can forecast
uncontrollable variables, not only in the future but also in the present and
past. For instance, predictive analytics can be used to classify perpetrators in
an offense that has already been done. It can also be employed in detecting
fraud while it is committed.
In the context of marketing, predictive analytics includes the implementation
of statistical methods, algorithms, and analytical procedures to both
structured and unstructured data sets to generate predictive models. With the
advent of big data and machine learning (AI), it is easy to determine the
possibility of a given outcome. Predictive analytics utilizes all accessible
consumer data and past behavior to determine as well as predict possible
consumer behaviors.
Why Predictive Analytics is Important
Applying predictive analytics in business can help to reduce risk as business
insights are made on the basis of evidence, not just unverifiable predictions
that depend on intuition.
That being said, Predictive analytics, when properly executed, can create a
positive effect on your marketing strategy long before leads generation.
As you transform your leads into paying customers, the data generated from
these new clients will sway the next generation of marketing campaigns.

Generation of Quality Leads: With predictive analytics,


advertisers can easily calculate the customer's willingness to
purchase a given product with precision.
The predictive analytics model will analyze consumer data to make
these predictions and thus allow marketing teams to provide top quality
leads and referrals to sales teams. A business can enhance the quality
of the leads it produces by recognizing and evaluating its heavy-value
buyers. Knowing this consumer group will also provide important
information into how the business can acquire more customers and
assess those more likely to switch to paying customers.

Improved Content Delivery: There's nothing more frustrating


than spending a lot of resources and energy to produce content,
only to realize that no one views or reads it, and sometimes the
absence of a proper content delivery plan is the reason.
Predictive analytics solves this issue by identifying the types of content
that most connect with consumers of certain socioeconomic or cultural
backgrounds, and then automatically delivering similar content to
represent the same social or behavioral trends.

Enhanced Lead Rating: With the application of predictive


analytics, the lead rating becomes less of a rating list of sales
parameters than the real statistics-driven view of your target
audience.
When paired with a good optimization tool, predictive analytics
procedures can easily score leads based on historical, behavioral, and
cognitive data.
Those scores decide whether the targets are "a quick catch" and must
be approached instantly by sales, or if they need more time in a
nurturing process before going further into the funnel.

Precise Lifetime Value Evaluation: You should know that your


client's lifetime value is the real indicator of ROI marketing.
However, the figure can be projected with the same predictive
analytical techniques that help you deliver content or rate leads. If you
study the overall lifetime value of existing customers that suit the
history of new clients, you can easily make a fair estimation of the
lifetime value of that new client.
Real-Life Use Cases of Predictive Analytics

Amazon deploys Predictive analytics for recommending suitable


products and services to users, which is strictly based on their
previous behavior. According to research, this predictive
analytics/ marketing has a major driving force for increased sales.
Harley Davidson depends on predictive analytics to reach new
customers, create traffic, leads, and maximize sales. They spot
potential heavy-value buyers who are willing to initiate
transactions. With this, the sales representative can reach out to
the customers and guides them through the purchase procedures to
find the right motorcycle.
StitchFix is yet another store with a creative sales model that
involves taking a style survey while predictive analytics is used to
pair customers with clothes they seem to like. If the customer does
not as the clothes received, they can return them at no cost.
Prescriptive Analytics – What Should be Done?
Prescriptive analytics draws on predictive analytics by recognizing suggested
(prescribed) actions based on the expected future (predicted) results to help
companies attain their business goals.
Be sure not to mix prescriptive with predictive analytics: Predictive analytics
indicates what might occur in the future, while Prescriptive analytics depicts
what to do in the future. This research into data provides a set of prospects
and challenges as well as solutions to be explored in different settings.
Prescriptive analytics models are actively "learning" using feedback effects to
evaluate behavior and activity interactions actively and to propose an ideal
solution. Prescriptive analytics will analyze the key performance metrics by
modeling a solution to ensure that the result can meet the appropriate metric
objectives before anything is executed.
Innovation-wise, prescriptive analysis involves a mixture of business rules
and requirements, collection of machine learning techniques (usually
monitored) modeling processes. All this is used to measure many
opportunities and to determine their odds.
After this process, you can utilize predictive analytics again to look for more
results (if essential).
It is widely used to perform the following activities:

Automation processes;
Online marketing;
Budgeting;
Content scheduling;
Content optimizations;
Brand supply chain management.
Prescriptive analysis can be utilized in a wide range of sectors. Generally, it
is used to get additional insight into data and to provide several ways to
explore when taking action, such as:

Marketing - for campaign scheduling and adaptation


Health care - for service automation and management
E-commerce/Retail - for supply chain control and customer care
Stock Exchanges-for creation of safety procedures
Construction - for evaluation of strategies and resource
Artificial intelligence, machine learning, and neural network algorithms are
often used to facilitate prescriptive analytics by making recommendations
based on complex trends and expectations of organizational priorities,
shortcomings, and variables of influence.
In the broader view of incorporating data analytics to corporate performance,
prescriptive analytics "offers value to a business through recommendations"
based on data outcomes.
The four types of data analytics discussed above help companies study and
learn from past experiences and results to enhance predictions and behavior.
Learning and knowing when and how to incorporate the appropriate method
of data analytics in addressing business concerns will help to improve the
business solutions you need while gaining a competitive edge above other
businesses in the same industry.
Chapter Six: Exploring Data Analytics Lifecycle
Most issues that seem large and overwhelming at first can be narrowed down
into manageable pieces or measurable stages that can be dealt with more
easily. A successful analytical lifecycle provides a robust and replicable
method of analysis. It emphasizes time and energy in the early process to
gain a clear understanding of the real-world problem that needs to be
addressed.
One common misconception in data analytics projects is jumping into data
analysis that precludes taking adequate time to prepare and scale up the level
of effort required, to consider the criteria, or to define the real-world problem
that requires a solution adequately. As a result, analysts can conclude in mid-
stream that the program managers are genuinely trying to accomplish a goal
that does not fit the available data sets, or that they are addressing a purpose
that varies from what has been previously conveyed. If this occurs, analysts
may need to return to the initial stage of the design for a thorough discovery
process, or in some cases, the project may be canceled.
Developing analytical data procedures helps to show stringency, and gives
the project further legitimacy when the analytical team presents its insights. A
well-defined process often provides a unified system for others to follow, so
that processes and research can be replicated in the future or as recruits join
the team. That said, this chapter outlines the analytical data lifecycle that is
important to project success.
Overview
Data Analytics lifecycle is built specifically for Data challenges and data
analytics initiatives. The process has six stages, and project work will take
place in so many different stages simultaneously. For most stages of this
analytical process, the transition can either be forward or vice versa. The
incremental process representation is meant to more accurately reflect a real
project where elements of the project progress forward and then revert to
earlier phases as new knowledge is revealed, and members of the team
acquire relevant knowledge about the different phases of the project. This
allows participants to push it through the process and to move towards the
implementation of project work.
Data analytics lifecycle may include the following phases:
Phase 1: Discovery
The very first phase of the Data Analytics Lifecycle is the discovery phase.
Here, you must discover, evaluate the issue, establish historical background
and knowledge, about the datasets required, and available for the project.
During this phase, decision-makers will actively examine market patterns,
similar data analytics case studies, and the industry domain. The evaluation is
made on in-house assets, in-house technology, and infrastructure. Once the
assessment is completed, decision-makers will start developing the
hypothesis that critical business dilemmas will be addressed in the light of
user experience and the economy.
This phase may include the following processes:
Step 1 - Understanding the Business Domain: It is necessary to understand
the context of the issue at hand. In several instances, a data analyst may need
extensive theoretical and analytical skills that can be implemented in various
disciplines.
Step 2 - Resource Discovery: As part of the discovery process, an analyst
needs to evaluate the tools required to cover the project. Resources, in this
case, include equipment, software, processes, data, and human resources.
Step 3 - Problem Design: A proper problem design is vital to the survival of
the project. This stage involves identifying the analytical problem that needs
to be addressed. The standard protocol at this stage is to develop a problem
statement and discuss it with relevant stakeholders. Here each analytical team
member can discuss essential factors that relate to the project problem.
In this stage, it is essential to examine the primary goals of the project, to
identify what must be accomplished in business terms as well as to identify
what should be done to meet the demands of the project.
Step 4 – Identifying Decision-Makers: Another crucial step is to recognize
decision-makers and their involvement in the project. During this stage, the
team will define selection criteria, key challenges, and decision-makers.
Here, decisions regarding the project benefactors or who will have a
significant effect on the project will be discussed.
Step 5 – Setting Up a Set of Hypotheses: Developing a range of initial
Hypotheses is a key element of the discovery phase. This step entails the
creation of strategies that the team can evaluate with the data sets. It's
important to set up a few basic hypotheses to evaluate and then be
imaginative in creating a few more. These initial hypotheses serve as the
basis of the analytical tests that the team will use in subsequent phases.
A part of this procedure includes extracting and analyzing hypotheses from
analysts and industry experts that may have their own opinions about what
the problem is, what the approach should be, as well as how to come up with
a solution. These project participants may have a good knowledge of the
subject area. They could make recommendations on hypotheses to be tested
as the team tries to establish hypotheses throughout this process. The team is
expected to gather several ideas that could elucidate the operational
generalizations of the decision-makers.
Step 6 - Finding Potential Data Sources: As one of the discovery phases, it
is important to define the types of data that will be used to address the
problem. Here, you are expected to define the length, form, and timeframe of
the data required to test the hypotheses. Make sure the team can easily access
more than just publicly available data.

Phase 2: Data Preparation


This Data Analytics phase involves data processing, such as the processes to
be analyzed, pre-processed, and conditioned. Here, the project participants,
decision-makers, or analysts need to create a stable environment that ensures
easy data analysis procedures.
The in-depth knowledge of the data is vital to the success of the project. In
this phase, the project team must also determine how to set up and convert
the data to ensure the data is converted into a format that enables subsequent
evaluation and analysis. The team may create graphical representations to
help team members interpret the data, as well as its patterns, anomalies, and
interactions between data variables.
Data preparation appears to be the most labor-intensive phase in the data
analytics life cycle. Each of these steps in the data preparation phase will be
thoroughly discussed in this section. In this stage, you can insert missing
variables, develop new categories to help classify data that has no proper
place, and delete duplicates from your data. Assigning average data ratings
for classes in which there are incomplete values can allow for proper data
processing without skewing it.
Step 1 – Preparation of Analytic Workspace: The first sub-phase of data
preparation involves the preparation of an analytical workspace in which the
team can analyze the dataset with no form of interference with live
production databases.
When establishing an analytical workspace, collecting all sorts of data should
be your best bet, as project participants may need access to quality volumes
and data varieties for the Big Data Analytics project. This may include
anything from structured and unstructured data sets, raw data streams, and
unstructured text data from call records or website logs, which is strictly
based on the type of research that the team plans to perform.
This broad approach to collecting data of all sorts varies significantly from
the strategy supported by a variety of IT organizations. Most IT organizations
have exposure to only a sub-segment of data for a specific goal.
Consequently, the importance of additional data can never be
overemphasized in a data analytics project, as most data analytics project is
often a combination of goal-driven analysis and innovative methods to
evaluate a range of ideas.
While the analytical team may want to have access to all available data that
are relevant to the project, it may be difficult to ask for access to every
dataset for analytical purposes. Due to these conflicting views on
accessibility and use of data, it is important for the analytical team to work
with IT, to explain what it is trying to achieve, and to match the target.
During this process, the analytics team needs to provide IT with a reason for
creating an analytical environment that is different from conventional IT-
controlled data warehouses within the enterprise. Successfully and friendlily
meeting the needs of both the data analytics team and the IT department
involves a close working relationship with various groups and data owners.
It's a big payoff. The analytical workspace allows companies to pursue more
ambitious data initiatives, step beyond conventional data analysis, and
business intelligence to more rigorous and sophisticated predictive analysis.
Step 2 – Extract Transform and Load (ETL): ETL is database functions
that are combined into a single tool to extract data from a database that have
been optimized for analytical purpose into another. The exact measures in
this method often vary from one ETL program to another; however, the
outcome is still the same.

Extract involves retrieving or accessing data from a database.


Here, data is usually extracted from numerous and different
varieties of sources.
Transform is the method of transforming the extracted data from
its original position to the form it should be in so that such data
sources can be conveniently stored in another database. This can
be done by using rules or search tables, or by integrating the
original data source with other data.
Load simply involves uploading the data to its desired database.
Note that before any form of data transitions, it is important to make sure that
the analytics workspace has enough storage and stable network connections
to the underlying data sources. This procedure helps to avoid interrupted
reading and writing.
Although ETL, users extract, transform, and load procedures to collect data
from a data store, covert data, and upload data back into a data store, the
analytical workspace method appears to be different; it supports extracting,
loading, and then transforming. Here, the data is collected in its raw state and
uploaded into a data store where analysts can choose to either transform the
data into a new position or keep it in its initial raw state. The rationale behind
this approach is that there is real value in storing and including raw data in
the analytical workspace before any data transformation or modification
ensues.
For illustration's sake, consider a fraud detection study on the use of credit
cards. Often, anomalies in this data group can reflect high-risk activities that
may reflect fraudulent activities in the use of credit cards. Using Extract
Transform and Load procedures may inadvertently filter or clean out these
anomalies before they are loaded into the data store. With this, the data that
can be used to identify fraudulent instances may be cleaned up accidentally,
which may throw a wrench on the project procedure.
Using the ELT method, the team will have access to fresh data for analysis
since the data is loaded into the database before they are transformed. Here,
analysts will have access to the data in its original form, which can help in
finding hidden complexities in the data. This strategy is one of the reasons
why an analytical workspace can grow exponentially. The analytical team
may want to clean any publicly available data; they may also want to retain a
copy of the original data to evaluate or search for hidden trends that may
have occurred in the data before the cleaning point. This method could be
described as ETLT to highlight the fact that an analytical team can choose to
execute ETL in a specific scenario and an ELT in another.
Based on the size and volume of data sources, the analytical team may need
to evaluate how to coordinate the flow of datasets into the workspace. For
such reason, transferring vast amounts of data is often alluded to as Big ETL.
These data movements can be multi-threaded by software like Hadoop or
MapReduce, which will be discussed in more depth as we continue.
A major part of the ETLT phase involves making an inventory of the data
and comparing the data available with the data sets needed by the team.
Carrying out this gap analysis sets standards regarding which datasets the
analytical team can use currently as well as where the team needs to start data
collection projects or access new datasets that are temporarily unavailable. A
portion of this sub-phase includes the extraction of data from sources
available and the determination of data links for raw data, online transaction
processing (OLTP) databases, online analytical processing (OLAP) cubes, or
other data feeds.
Step 3 - Performing Data Insights: an important aspect of every data
analytics project is to become acquainted with the data. Spending time to
study the complexities of datasets offers background knowledge to data
analysts. This background knowledge helps to depict what represents a good
value and projected performance, as well as what would be a shocking
finding. Some of the activities in this stage can overlap with the initial
investigation of datasets that occur during the discovery phase.
Step 4 – Data Conditioning: Data conditioning refers to the data-cleaning
process, the standardization of datasets, and the conversion of data. Data
conditioning is a critical sub-phase within the Data Analytics Lifecycle. It
requires several complex measures to combine or integrate data sets and
otherwise carry data sets to a condition that facilitates analysis in more
phases. Data conditioning is commonly perceived as a pre-processing stage in
data analytics procedures since it requires multiple dataset tasks before the
development of data analysis models. This suggests that the conditioning
phase is conducted by data owners, IT, data owners, or a system engineer.
Part of this step includes determining increasing aspects of which specific
dataset will be valuable for further study. Since teams are starting to generate
ideas at this stage regarding which data to store and the data to convert or
discard, it is crucial to feature several members of the team in these
assessments. Leaving these assessments in the hands of a single person can
cause the teams to revert to this stage to retrieve data that may have been
dismissed.
Step 5 - Data Survey and Visualization: After the team has compiled and
gathered some of the datasets required for the study, a valuable move is to
use visualization software to generate a summary of the data. More so,
finding high-level trends in the data helps one to recognize data attributes
pretty fast. A simple illustration is the use of data visualization to analyze the
quality of the data, including whether the data contains several unknown
variables or other filthy data metrics.

Phase 3: Model Planning


In Phase 3, the analytics team selects target models to be used for data
classification, clustering, or all other data-related discovery, based on the
intent of the project.
It is in this process that the team resorts to the theories formed in the first
phase, where they became acquainted with the data, recognized market, or
domain issues. Some of the events to be addressed in this phase will include
the below:
Step 1 - Data Exploration and Variable Choice: While some data
exploration occurs during the data preparation phase, these tasks concentrate
on data safety and on evaluating the accuracy of the data itself. The aim of
data exploration in this phase is to grasp the difference that exists between
variables to influence the choice of variables and methods as well as
developing insights regarding the problem area.
Just like other earlier phases of the Data Analytics Lifecycle, it is essential to
devote attention to this preparatory work to enable easy and efficient
subsequent phases. An effective way to go about this is through the use of
techniques to execute data visualizations. Approaching data exploration in
this way, helps the team to analyze data and determine the connection
between variables at a decent level.
Step 2 – Model Selection: In this sub-phase, the primary objective of the
team is to select an analytical strategy or a small selection of alternative
techniques which are strictly based on the final objective of the project. I will
briefly explain the term 'model' as it relates to the subject matter (data
analytics). A model simply defined as the abstraction of fact. One examines
happenings in a real-world situation and seeks to create models that replicate
this action with a series of rules and requirements. In machine learning and
data mining, these series of rules and requirements are classified into a range
of a general set of techniques, like Classification, clustering, and association
rules. When evaluating this set of possible models, the team can shuffle
through a list of several feasible models to fix a given problem. Further
information on fitting the right models to specific types of business problems
will be discussed as we progress.
Generally, the tools utilized in the data planning phase may include R, SQL
analysis service, and SAS.

Phase 4: Model Building


Here, the analytical team develops data sets for training, testing, and
production objectives. These data sets allow the data analyst to create an
analytical model and train it while preserving some of the data for model
evaluation purposes. During this phase, it is important to make sure that the
datasets required for data training and test are sufficiently broad for modeling
and analytical approaches. An easy way to explore these datasets is to
classify the training data sets for performing the preliminary experiments. In
contrast, the test sets are classified for validating a strategy when the initial
experimentation and models have been carried out.
In this phase as well, an analytic model is built, tailored to the training data,
and scored against the test data. In this phase, analyst run models with the
help of software packages like R or SAS on file extracts and small data sets
for experimental purposes. You can determine the suitability of the model
and its performance.
For example, decide if the model portrays most of the data and has reliable
predictive value. At this point, modify the models to improve the
performance, for example, by changing var. Although the modeling strategies
and rationale needed to create models are extremely complicated, the overall
length of this phase is quite short when compared to the time spent in data
preparation. The common tools employed in this phase may include:

Commercial tools like SPSS modeler, Matlab, SAS enterprise


miner, Statistica, and Mathematica.
Open-source tools like R, Octave, Python, SQL, and WEKA.

Phase 5: Communicating the Outcomes


Following the execution of the model, the analytical team is also expected to
compare the results of the model with the success and failure criteria.
Throughout this phase, the team discusses the best way to communicate
findings and conclusions to various analytical members, decision-makers,
and stakeholders, while taking into consideration caveats, assumptions, and
any shortcomings of findings. Since the analysis is often shared within an
organization, it is important that the results are adequately communicated and
that the conclusions are presented in a manner that is suitable for the public.
Another part of this phase involves determining whether the project goal has
been accomplished or not. Nobody will ever want to accept defeat or failure.
However, failure, in this case, should never be seen as a real failure, but
instead as the inability of the data to accept or reject a defined hypothesis.
In this phase, the analytical team must be sufficiently analytical with the data
to assess if the data will prove or disprove the assumptions established in the
discovery phase. In cases where a shallow study was carried out by the
analytical teams, the outcomes become less rigorous enough to accept or
reject a hypothesis. Many times, analytical teams may also conduct very
rigorous analysis; but in this case, the source for opportunities to display
results, even though the results may not be available. Hence, it is imperative
to find the right balance between these two extremes whenever it comes to
evaluating data and being realistic when presenting real-world outcomes.
Phase 6: Operationalize
Throughout the final phase, the team must convey the advantages of the
research more generally and establish a preliminary study to implement the
work in a managed manner before expanding the project to a complete
organization or consumer environment.
In summary, data analytics varies greatly from the conventional statistical
method to experimental design. Analytics begins the analysis procedure with
data. Typically, we structure the data in a way that explains the outcome. The
goal of this strategy is to foresee the outcome or to clarify how the variables
contribute to the outcome. Generally, a study is built in statistical model
structures, while data is collected as a result. This ensures that the data
generated can be used by a statistical model.
Chapter Seven: Wrapping Your Head Around Data
Cleaning Processes
I wasn't going to write anything on data cleaning until I remembered some
funny experiences I faced a few months ago. Over the last few months, I'd
tried analyzing data via devices, polls, and reports. And no matter the number
of charts I have made, how advanced the algorithms are, the results were still
misleading.
Trust me! Tossing the data from the random forest is the same as loading it
with a bug: A bug that has no other purpose than to damage your knowledge
as if your data were spewing crap. Even worse, when you give the CEO your
data results, and Oh guess what? He/she discovered some mistakes,
something which doesn't look good, your findings would for no reason suit
their understanding of the industry—obviously, and they are domain
professionals who know more than you as an analyst or developer. And
perhaps you were summoned, but you don't know anything about what just
occurred. Here is a quick tip on what just happened - You swallowed a lot of
dirty info, you didn't bother cleaning it up.
Most analysts swallow dirty info or outliers that end up affecting their results.
For this reason, It is, therefore, necessary to become acquainted with data
cleansing and all other aspects that are important.
What Exactly is Data Cleaning?
Data cleaning, also known as Data cleansing, is concerned with finding and
fixing (or, in some cases eliminating) inaccurate or damaged details from a
dataset, table, or database. Generally speaking, Data Cleaning refers to the
detection of missing, incomplete, obsolete, unreliable, or even objectionable
('unclean') pieces of the data and then the substitution, alteration, or deletion
of that unclean data. With successful Data Cleaning, all data sets should be
free from any errors that may be problematic during the study.
Data Cleaning is commonly believed to be a mundane component. Yet, it is a
vital mechanism that helps companies conserves money and increases their
performance.
It's pretty much like getting ready for a long holiday. In such a case, we may
not like the holiday planning phase. Still, we can spare ourselves one of the
most common nightmares of the ride by organizing the specifics in
anticipation.
We either need to do something, or we can't make fun of it. It's just that easy!
The Common Component in Data Cleansing
Everyone is cleaning up records, but no one talks about the components or
processes involved. It's definitely not the' tackiest' part of Data Analytics, so
there are no secret tricks to discover.
Even though different types of data may demand various types of cleaning,
the standard actions we have set out here will still serve as a useful point of
reference.
Now, let's start clearing up those data mess!
Step 1 - Deleting Unnecessary Findings: An essential procedure in data
cleaning is to eliminate unnecessary findings from our dataset. These
unwanted findings may contain redundant or insignificant findings.
Redundant findings occur most often throughout data collection. This
happens, for example, when we merge datasets from different locations or
collect data from clients. These findings change the output to a large degree
since the data is repetitive or redundant and, thus, may affect the result
negatively. Insignificant findings are those that just do not resolve the actual
issue that we intend to solve. These findings are any form of data that is of no
value to the datasets, which can as well be excluded directly.
Step 2 - Fixing Structural Errors: The next step to data cleaning is to fix
structural errors in our data collection. They are those errors that arise during
a calculation, data transmission, or other related circumstances.
These errors usually involve typographical errors (typos) in the name of the
functions, the same attribute with a different name, mislabeled classes, i.e.,
separate classes that should be the same, inconsistent capitalization. Such
structural defects render our model unreliable, which may as well result in
poor output.
Step 3 - Sorting Unwanted Outliers: The next step to data cleaning is to
remove unwanted outliers from our data collection. In most cases, a data
collection may include outliers that are far away from the majority of the
training results. These outliers may exacerbate the subject matter for other
types of data analytics models. However, the outsiders are innocent unless
proven guilty, and we would have a valid reason to exclude the intruder.
Often eliminating outliers enhances model efficiency and sometimes doesn't.
Detecting Outliers With Uni-Variate and Multi-Variate Analysis
Most approaches to statistical and machine learning presume that the data is
without outliers. As stated earlier, Outlier elimination is an essential part of
preparing an analysis of your results and, in fact, an important aspect of data
cleansing. In this section, you will see several ways you can use to find
outliers in your results.
Extreme Values Analysis
Outliers are data points with values that are substantially different from the
majority of data points that form a variable. Finding and eliminating outliers
is critical because, when they are unaddressed, they distort the variable
distribution, make variance appear unnecessarily large, and cause inter-
variable correlations to be misrepresented. Simply put, outlier identification
is a form of data preparation and an empirical tool of its own.
Outliers may fall into the preceding categories:

Point: Point outliers are datasets with abnormal values in


comparison to the usual value range in a function.
Contextual: Contextual outliers are datasets that, even in a
particular context, are abnormal.
Collective: These outliers appear close towards each other, all of
which have similar values that are abnormal to most of the values
in the function.
Sometimes, these outliers can be identified using either a univariate or
multivariate method.
Univariate outlier detection involves studying the attributes in your datasets
and scrutinizing them individually for abnormal values. To do this using
machine learning strategies, two basic methods can be employed

Tukey outlier marking


Tukey boxplot
Detecting outliers using Tukey outlier marking is quite hard, but if you
choose to do it, the key here is to determine how far the lowest and the
highest value of the 25 and 75 percentile are. The spread between the 1st
quartile and the 3rd quartile is referred to as the inter-quartile range (IQR),
and it represents the distribution of the results. When trying to determine if a
variable is suspicious for outliers, consider their distribution, their Q1/Q3
values, and their lowest and highest values. A strict thumb rule to follow is
X= Q1-1.5*IQR and Y= Q3+ 1.5*IQR. When your lowest value is below x,
or your highest value is above Y, it simply implies that the variable has an
outlier.
A Tukey boxplot is relatively easier to calculate when compared to Tukey
outlier marking. Since each boxplot comes with whiskers that are pegged at
1.5*IQR, any variables that are above such whiskers are simply outliers.
Detecting outliers with multi-variate analysis
Often outliers emerge only through diverse variables across varieties of
datasets. These outliers really do unleash chaos, hence detecting and
removing them is critical. For this purpose, you can use multi-variety
outliers’ analysis. A multivariate outlier detection strategy requires evaluating
two and sometimes more variables at a time and testing both for outliers.
You can use various methods to achieve this. Some of which include

Scatter-plot matrix
Boxplot
Density-based spatial clustering of noise applications (DBScan)
Principal component analysis
Step 4 - Managing Missing Data: One of the curious things in Data science
is' Missing Information.' To be sure, you can not necessarily disregard the
missing values in your data collection. For very pragmatic reasons, you must
treat missing data in some way, as most of the ML algorithms used do not
recognize data sets with missing values. Let's dig at the two most widely
suggested ways to deal with missing data.

Drop Findings With Missing Values: this is a sub-optimal


approach, since dumping observations may imply leaving out
some vital information. The explanation for this is that the missing
value may be insightful, and in the world of reality, we still have
to make assumptions about new data even though some of the
elements are absent.
Assigning Missing Values Based on Historical or Other
Findings: this is a sub-optimal approach, because regardless of
how complex our imputing method or substitute value is, the
absence of the original value may not completely reflect the
original value, and this often leads to a loss of information. As the
missing value can be insightful, we should notify our algorithm if
the value is missing. Also, if we measure our values, we simply
reinforce the patterns already given by other functions.
In a nutshell, these two approaches to fixing missing data do not work
completely in most cases. How then can we tackle missing data? Here are
some few tips:
To treat missing data for categorical functions, simply mark it as ‘Missing’:
In so doing, you are introducing a new class for the feature. This procedure
informs the algorithm that there is a value missing.
To manage missing quantitative data, label, and fit in the values. With this,
we are allowing the algorithm to evaluate the optimum value for absence.
Filling the missing value, in this case, may require assigning the '0' value.
The zero value, in this case, informs the algorithm that there is a missing
value.
Chapter Eight: Unraveling the Role of Math,
Probability and Statistical Modeling in the World of
Data Analytics
Trust me! Math and statistics are not the evil creatures that most people
perceive them to be. In data analytics, the use of such quantitative methods is
simply a part of life— and in fact, nothing to be disturbed about.
While you may need an in-depth knowledge of mathematics and statistics to
solve an analytical problem, you don't need a certificate in these fields.
Contrary to what other mere statisticians would have you assume, the world
of data analytics is completely different from the world of statistics.
Generally, Data analysts may require specialized experience in one or more
areas. They also require the use of statistics, math, and good communication
skills to help them identify, grasp, and express insights into data that lie
inside raw datasets specific to their area of interest. Well, the truth is -
Statistics is a crucial part of this process.
Within this chapter, I present the fundamental concepts around probability,
correlation analysis, dimensional reduction, regression analysis, and time
series analysis.
Understanding Probability and Inferential Statistics
Probability is one of the most basic aspects of statistics. You ought to be able
to make some basic decisions (like deciding whether you are looking at
descriptive or inferential statistics) to make the data relevant. This can only
be achieved by having a good understanding of the fundamentals of the
probability distribution. Such principles and more are discussed in the
following pages.
A statistic is the product of a statistical procedure on quantitative data. In
general, you use statistics in making decisions about a given data set.
Statistics may come in two different flavors:

Descriptive: Descriptive statistics like the keyword implies, offer


a good description that explains some of the features of the
numerical dataset, such as the array of datasets, the central trend
(like mean, min, or max) and the measure of variation and
dispersion (e.g., the standard deviation and variance).
Inferential: Instead of concentrating on the actual details of the
dataset, the inferential statistics plot a smaller proportion of the
dataset and try to infer relevant information about the larger
dataset. You can utilize this sort of statistics to gain information
about the actual-world metric that you are interested in.
It is quite accurate that descriptive statistics explain the qualities of a
numerical dataset! It doesn't specify why you should care. In reality, most
data analysts are more concerned in descriptive statistics just because of what
it revealed about the real-world metrics that they describe. For instance, a
descriptive statistic is sometimes paired with a degree of precision, implying
the importance of the metric as an assessment of the real-world metric.
To fully understand this idea, imagine that the founder of a company would
like to measure the income of the next quarter. The business founder may
have to use the average income of the last few quarters as an indicator of how
much he will make in the coming quarter. But if the profits of the prior
quarter differed greatly, a descriptive figure measuring the difference in the
expected profit value (the sum by which this dollar calculation might vary
from the real income it would make) would show just how far off the
expected value might be from the real value.
You could use these descriptive statistics in several cases— for instance, to
spot outliers which I already explained in chapter eight, to prepare for pre-
processing tools, or to efficiently recognize what functionalities you may or
may not want to include in an analysis.
Just like descriptive statistics, inferential statistics are also employed to
demonstrate certain details about a variable in the real world. Inferential
statistics do so by presenting facts on a small subset of data so that the details
or information derived from this small data subset can be used to make
inferences about the broader dataset in which the small data subset was
extracted. In statistics, this small data choice is referred to as a sample, while
the larger and more complete data set from which the sample is taken is
referred to as a population.
If the dataset is too large to be evaluated in its totality, make a smaller sample
of this dataset, test it, and afterward, draw inferences about the overall dataset
depending on the information generated from the analysis of the sample. You
may also employ inferential statistics in cases where you cannot manage to
obtain data for the whole population.
In this case, you can use the data that you do need to draw inferences about
the population at large. At other times, you can find yourself in circumstances
where there is no full knowledge available to the population. In this case, you
would use the data that you need to conclude a specific population at large.
You can also employ inferential statistics on many other occasions, where
there are no comprehensive data available to the population of interest. In
such cases, you inferential statistics can be employed to ascertain the missing
data values, which may be strictly dependent on what you know from the
study of the existing data.
To ensure accurate inferences, you ought to carefully pick the sample to
obtain a true reflection of the population.
Probability Distributions
Assume that you recently settled into Las Vegas and probably in your ideal
roulette spot. Whenever the roulette spins off, you realize instinctively that
there is an increased likelihood that the ball can drop into any of the cylinder
slots. The spot where the ball drops into is completely random, and the
possible likelihood, or probability, of the ball dropping into one hole over the
other, is equal. Because the ball can drop into any hole, with the same
probability, there seems to be an even probability distribution. In other
words, the ball has the same likelihood of landing in all of the openings in the
ring. However, the slots on the same roulette are not completely the same—
the wheel has eighteen black slots and twenty red or green slots. As a result
of this structure, there is an 18/38 risk that the ball will fall on a black spot.
You're looking forward to making subsequent bets that the ball will fall on a
black slot. Your net gains, in this case, can be called a random variable,
which is a metric of a characteristic or value aligned with an event, an
individual, or a location (for the real-world scenario) that is unpredictable.
Since this characteristic or value is barely predictable, it also doesn't imply
that you know nothing about it.
Moreover, you use what you know regarding this to make your choices. Let's
see how this works.
A weighted average is the expected value of a variable over a wide number of
existing data points. When you take a weighted average of your wins (your
random variable) throughout the probability distribution, this resulting value
is known as the expectation or the average value. (You can also think of an
assumption as the best guess if you had to guess.)
Well, enough of the illustrations, a probability distribution is simply defined
as the list of all possible events of a random variable. It is important to take
account of every possible event whenever you are considering the likelihood
(probability) of any given variable. Also, note that only one of these events
can occur.
Common Attributes of Probability

The probability of all possible outcomes or events, when added


up, must be equal to one (1.0).
The probability of a single event can only be equal to or fall
between 1.0.
A probability distribution is categorized according to the following two
types:

Discrete Distribution: A probability distribution is discrete if the


random variables are values that can be easily counted by
groupings
Continuous: Here, a random variable that attributes probabilities
to several values. To explain discrete and continuous distribution,
consider two variables from a dataset that describes motorcycles.
A color variable should have a discreet distribution since
motorcycles have just a small variety of colors (say blue, white, or
black).
Observations should be computable through color grouping. On the other
hand, a variable measuring the mileage of vehicles for every gallon, or
"mpg," should have a continuous distribution since each vehicle might have
its independent value for "mpg."

Normal Distributions (Numeric Continuous): Described


visually by a symmetrical bell-shaped curve, these distributions
model observations that converge toward certain highly possible
occurrences (at the top of the bell in the bell curve); occurrences at
both poles are less likely.
Binomial Distributions (Numeric Discrete): These type of
distributions model the number of successes which can occur in
several trials in which only two results are feasible (e.g., the
conventional coin flip hypothetical situation). Binary variables —
those variables assume only one out of two values— have
binomial distributions.
Categorical (Non-Numeric) Distributions: A categorical
distribution comprises either non-numeric categorical variables or
ordinal variables.
Calculating and Measuring Correlation
Many statistical and machine learning approaches presume that individual
data characteristics are independent and are not caused by any variable.
However, to determine if they are truly independent, you need to assess their
correlation— the degree to which the variables show interdependence.
Throughout this section, I will briefly explain the Pearson's correlation and
Spearman's rank correlation.
Correlation is analyzed per variable value, which is otherwise known as r. the
R-value is generally known to fall between –1 and 1. The nearer the r-value is
to 1 or –1, the greater the association between the two variables. If, on the
other hand, these two variables have an r-value that is near the value 0, it
simply implies that the variables under consideration are independent.
Pearson's R Correlation
To discover the relationship that exists between continuous variables in a
given data set, an analyst can employ statistical techniques to determine this
connection. However, one of the best and easiest correlation analyses that can
be used to achieve this effortlessly is the Pearson correlation. This type of
correlation analysis is based on the following assumptions:

The data to be evaluated must be normally distributed.


The variables are numeric values, and yet continuous in nature.
The variables are linearly linked.
Since the Pearson correlation has too many requirements, you can only use it
to evaluate if the relationship that exists between two variables, and not to
rule out such relationships.
The Spearman Rank Correlation
Spearman's rank correlation is a common test for evaluating the association
that exists between ordinal variables. By implementing Spearman's rank
correlation, you can transform numerical variable sets into different rankings
or ratings. This is achieved by measuring the magnitude of the correlation
between variables and afterward rating them through their correlation.
The Spearman rank correlation is also based on the following assumptions:

The variables to be analyzed are ordinal.


Unlike the spearman R correlation, the variables are not linearly
connected.
The data to be analyzed are not usually distributed.
Exploring Regression Methods
Regression analysis is another analytical technique adopted from the statistics
field to provide data analysts with a range of methods for defining and
measuring the interaction between variables in a dataset. You can use
regression analysis to evaluate the extent of the correlation among variables
in your records. You may use regression to forecast possible values from
historical values, but be vigilant: Regression approaches presume a cause-
and-effect interaction between variables, but current conditions are often
liable to change. Estimating potential values from historical data can yield
inaccurate results as conditions on which such historical data exist may alter.
In this section, I will be explaining some commonly used regression
techniques (logistic regression, linear regression, and ordinary least square
method) that have been used in the past by most data analysts.
Linear regression is a machine learning approach used to explain and
measure the connection between your target variable, y, which is the
predicted or dependent, in statistics jargon, and the variable attributes that
you have selected as a predictor or independent variables (typically variable
X in machine learning). If you use a single variable as your independent or
predictor variable, this type of regression analysis is called linear regression.
Linear regression is as plain as the general algebraic formula depicted as y=
mx+b. More so, you can also employ linear regression in measuring the
differences between different variables in a dataset — this is known as
multiple linear regressions.
However, before you start getting too pumped up about using linear
regression for your analytical purposes, make sure you are familiar with its
underlying drawbacks.
Here are some common limitations of linear regression:

Linear regression deals with numerical variables, rather than


categorical variables.
When the data set is incomplete, such that it has missing values,
this can trigger issues. Hence, before trying to construct a linear
regression model, ensure you fix the missing values. This can
prevent analytical errors and inferences.
If your data has anomalies or outliers, your model will show false
reading, consider checking for outliers before proceeding with the
analysis.
Linear regression believes there is a linear relationship between
the dataset attributes and the control variable. Hence, it is
important to ensure that your data set truly depicts a linear
relationship between variables. However, if there is no form of a
linear relationship between variables of interest, consider using a
log transformation to adjust for it.
The linear regression model suggests that all attributes are
independent.
Estimation errors, commonly known as residuals, are usually
normally distributed.
Don't forget the size of the dataset as well! The simple rule here is
that you must have at least twenty observations per predictive
function. This assures that the produced is accurate
Logistic Regression
This regression analysis is a machine learning approach that can be used in
estimating the values for a categorical target variable strictly dependent on
the defined characteristics. Your target variable must be a numerical variable
and must contain values that define the target group. An interesting thing to
note about logistic regression is that asides from estimating the class of
findings in your target variable, it shows the likelihood or probability of each
projection. While logistic regression is almost similar to linear regression,
logistic regression criteria are quite simpler, straightforward, and less
complex when compared to those of linear regression.
Below are the criteria of logistic regression:

There should be no form of a linear relationship between the


features and the target variable.
The distribution errors or Residuals must not be normally
distributed.
The Predictive functionality must not be normally distributed
To correctly determine if logistic regression is the Best Shot for your
analysis, it is also important to consider the following constraints:

Any Missing variables within the dataset should either be fixed


immediately or eliminated.
The target variable must be an ordinal or binary variable
The predictive characteristics should be distinct and not dependent
on each other.
To generate a reliable result, Logistic regression requires a higher
amount of observations (when compared to linear regression). The
rule of the game here is that the number of observations must not
be lower than 50 for each predictive attribute. This ensures that
the result obtained from the analysis is accurate and reliable.
The Ordinary Least Square Regression Method
Ordinary least squares (OLS) are another statistical approach to data analysis.
Here, a linear regression line is appropriately fitted to a dataset. With the
Ordinary Least Square method, you try to square the vertical distance
variables that define the gaps between the data sets and the best-fit line.
Those squared differenced are summed up, as well as changing the location
of the best line of fit in order to reduce the summed squared distance value.
In other words, if you intend to create a function that is a clear approximate to
your results, you can use the ordinary least square method.
But hey, don’t expect the real value to be an exact match to the value
estimated. The values projected by the regression are essentially predictions
that are only close to the real values of the model.
This statistical technique is specifically useful for fitting a linear regression
line to model specification containing multiple independent variables. This
way, the ordinary least square method can be used in estimating the target
variable from a set of data features. Please bear in mind that when using the
ordinary least square regression method with multiple independent variables,
such independent variables can be interrelated. However, when more than
one independent variable is highly correlated with the other, it is known as
multicollinearity.
Multicollinearity appears to negatively affect the accuracy of Independent
Variables as predictors when they are tested individually. The good thing is -
multicollinearity does not reduce the general predictive accuracy of the model
when taken together.
The Time Series Analysis
A time series is a list of data on the attribute's value over some specific
period. Time series analysis can be used to predict future occurrences of
measurement based on historical empirical results. In other words, you can
employ time series techniques when you want to estimate or predict new data
values in your data set and against time.
Recognizing Patterns in Time Series
The time series shows unique patterns. Constant time series maintains
approximately the same level over time. However, they are often susceptible
to some set of random errors. On the other hand, the trended series exhibit a
steady linear motion (usually towards an upwards or downwards direction).
Regardless of the trend type (that is, whether constant or trendy), time series
are often known to demonstrate seasonality— predictable, cyclical variations
that occur seasonally all year round. As an illustration of seasonal time series,
consider the number of grocery outlets that have successfully increased their
revenues during the festive season. In other words, these businesses are
generally known to show increased sales during the festive season since most
families shop more during this period.
If you have seasonal variation in your model, you should include it in the
quarter, months, or even six months— where possible. Time series can
demonstrate non-stationary procedures— or volatile cyclical activities that
are not seasonal. In fact, in such cases, the volatility is often known to stem
from economic or industrial conditions.
Since they are not linear, non-stationary procedures are barely predictable. If
you are faced with such a condition, you must convert non-stationary data to
stationary data before proceeding with the analysis.
Univariate time series data is analogous to how multivariate analysis
evaluates the relationship that exists between more than one variable, and
univariate analysis is a statistical study of a single variable at a time. More
specifically, this involves modeling time series variations that reflect changes
in one variable over time. Autoregressive Moving Average (ARMA) is an
example of such a measure. ARMA is a group of prediction methods that can
be used to estimate possible values from historical and current data.
As the name suggests, the ARMA family incorporates self-regression
techniques, which is an analysis that assumes that past studies are strong
determinants of future values and conducts a self-regression analysis to
forecast future values, along with moving average techniques — models that
calculate the constant time series value and then adjust the prediction model
if any shifts are observed. If you want a simple model or a model that can
work for a limited dataset, the ARMA model is not well suited for your
purposes. A quick alternative, in this case, is to stick to a simple linear
regression model instead. To effectively use the ARMA model, you must
have 50 observations and above. Simply put, the model is best suited for
large sample data sets or observations.
Chapter Nine: Using Machine Learning Algorithm
to Extract Meaning From Your Data
Have you ever noticed how technology dominates your life and how it has
succeeded in predicting occurrences? Yeah, this is real!
With the advent of Artificial intelligence and Machine Learning Systems,
every operation is perceived to be under the impact of these technologies,
which I believe you may have encountered when using social media
platforms or those that are dependent on Machine Learning.
Let's explore this controversial term, "Machine Learning," and how it applies
to Data analytics.
What is Machine Learning?
Machine learning is an aspect of Artificial Intelligence where machines
operate on their own, and strictly based on past occurrences to predict
outcomes. Across all fields, machine learning technologies have become
important. Machine learning is a method for transforming data into
knowledge.
Well, it is no longer news that there has been an influx of data over the last
50 years. However, this large volume of data generated daily is worthless
unless we study it and identify patterns that are embedded therein. Machine
learning techniques are used to automatically detect useful inherent patterns
inside complex data that we would otherwise have failed to explore. This
inherent trends and understanding of the problem could be used to forecast
future events and to make all sorts of complicated decisions.
How it Relates to Our Subject Matter (Data Analytics)
The principle of machine learning has been around for some time now.
Nevertheless, the ability to perform mathematical and statistical equations
automatically and easily using big data is now growing in popularity.
Machine learning has been employed in a range of areas, including Google's
self-driving vehicle, web suggestion engines–friend suggestions on
Facebook, product suggestions from Amazon, and digital fraud prevention.
With the entire buzz surrounding machine learning, several companies are
wondering if there should be any kind of machine learning application in
their company. Well, the reaction throughout the vast majority of cases is a
big fat ‘no’!
One of the main advantages of the Web is that it allows you to utilize nearly
unlimited storage and computing capacity to obtain valuable insights from
the data the sensors/devices are gathering. Data Analytics and machine
learning can serve as a useful tool to achieve this. However, there is still
some level of uncertainty regarding when to employ machine learning into
the world of data analytics.
At an advanced level, machine learning takes a vast volume of data and
creates valuable insights to support the company. This may mean enhancing
procedures, cutting costs, providing improved customer service, or creating
new business models.
However, most of these advantages can be extracted from conventional data
analytics in most companies without the incorporation of more sophisticated
machine learning frameworks.
Traditional data analytics is useful for describing the results. You may
produce reports or models on what has occurred in the past or what is
happening now, drawing valuable insights to the company.
Data analytics can help identify and document goals, allow better choice-
making, and then proffer effective means for assessing performance over
time. But hey! Data models that are characteristic of conventional data
analytics are sometimes static and of minimal use when handling rapidly-
changing and unstructured data. When it comes to IoT, there is also a need to
establish associations between thousands of sensor feeds and external factors
that are constantly accumulating billions of datasets.
Although conventional data analysis will involve a model based on historical
data and expert advice to create a link between variables, machine learning
begins with the output variables (e.g., energy savings). It then continuously
searches for independent variables and their associations.
Generally, machine learning is useful if you know exactly what you want, but
you just don't have an idea of the relevant inputs or variables that are required
for making decisions on your subject matter. However, you can give the
machine learning algorithm the goal and objective(s) you which to achieve
and then "learn" from the data the variables that are essential to accomplish
such a goal. Google's deployment of machine learning to its data centers is a
perfect example of this. Data centers have to remain calm; hence, they require
large quantities of energy for their cooling systems to function correctly. This
is a big expense to Google, but the aim was to increase the performance of
machine learning.
For over a hundred variables influencing the cooling system (that is, fans and
speeds), designing a model with a traditional approach is a huge task. Instead,
Google used machine learning to reduce total energy usage by more than ten
percent. This is more than millions of dollars savings for Google in the near
future. More so, machine learning is also useful for forecasting future
outcomes accurately. While data models developed using conventional data
analytics are fixed, machine learning algorithms are continually evolving
through time since more data is collected and absorbed. This implies that the
machine learning algorithm will predict the future, perceive what is truly
going on, interact with its projections, and then modify to become much more
effective.
Machine Learning Approaches
Several approaches have been successfully used to perform Machine
Learning. They are generally grouped in the areas mentioned below.
Supervised and unsupervised are well-known methods and perhaps the most
widely adopted. Semi-supervised and Reinforcement Learning both are new
but yet more complicated, and have produced promising results.
The 'No Free Lunch Theorem' is a notable theory in Machine Learning. This
states that there's no specific algorithm that fits in for all functions. Each
problem you seek to address has its peculiarities. Thus, there are a whole lot
of algorithms and solutions for each particular issue. A lot of Machine
Learning and AI types will continue to be implemented that best suit specific
problems.
Supervised Learning
The purpose of this machine learning approach is to understand mapping
(rules) between a collection of inputs and outputs.
For instance, the inputs could be the weather report, while the outputs could
be the seaside tourists. The purpose of supervised learning in such a situation
will be to know the mapping that explains the connection between climate
changes and the number of seaside visitors.
Supervised learning algorithms work efficiently with historical data that has
specified values. Mapping these specified values in training datasets are
referred to as labeling. Individuals can assign tasks to the algorithm by telling
the algorithm the characteristics to look out for as well as the value
judgments that are correct or incorrect. By perceiving a specific label as an
example of successful prediction, the algorithm is instructed to find these
specified attributes in future data. Presently, supervised machine learning is
aggressively utilized for both classification and regression challenges, as
overall target attributes are now accessible in training datasets.
This distinguishes supervised learning as one of the conventional approaches
utilized in solving business problems and making business decisions. For
instance, if you use a binary classification to determine the probability of lead
conversion, you understand which leads have been converted and those that
have not been converted. You may define the target values and thus train the
model. Supervised learning algorithms are often used to recognize objects on
images, identify the emotions behind specific social media messages, and
forecast numeric values such as weather, rates, etc.
The ability to adapt to current inputs and make projections is a crucial
generalization aspect of machine learning. In training, we want to optimize
generalization so that the supervised model determines the actual' overall'
fundamental relationship. When the model is over-trained, we trigger over-
fitting to the samples employed, and the model will not be adaptable to new,
previously unknown inputs. A drawback to be mindful of in supervised
learning is that the supervision provided may create some level of bias in the
learning. The model can only replicate necessarily what has been seen, so it is
vital to provide accurate, objective samples.
Furthermore, supervised learning typically needs a lot of data before it learns.
Obtaining sufficiently accurately labeled data is always the most stringent
and most costly aspect of using supervised learning. The output from a
supervised Machine Learning model may be a group from a finite dataset. If
this is the scenario, the decision here is based on how to classify the data, and
so it is regarded as classification.
Conversely, the output may be a real-world scale. If that is the case, it is
regarded as regression.

Classification: Classification is used to distinguish related data


points in separate areas. Machine Learning is used to define the
rules that describe how these various data points are distinguished.
But really, how are these magical laws put in place? Ok, there are
several ways to discover the rules. We both rely on the use of data
and responses to identify laws that linearly divide data points.
Linear separation is an essential central principle in machine
learning. The term linear separation simply states that a line splits
the different data points?’ Put simply, classification methods are
trying to find the best way to differentiate data points with a line.
The lines established between groups are known as decision
boundaries. The whole field selected to describe the class is
defined as the decision surface. The decision surface specifies that
when a data point drops within its boundary, a certain class will be
allocated.
Regression: Regression is one of the most commonly used types
of supervised learning. The major discrepancy between
classification and regression is that regression generates a number
variable instead of a class. In other words, it is employed in
effectively forecasting number-based concerns, such as equity
market values, or the likelihood of an occurrence.
Unsupervised Learning
Unsupervised learning is designed to organize those datasets that do not have
target values. In such cases, the focus of machine learning is to identify
trends in values and to organize artifacts based on the most recent similarities
or discrepancies. In the field of classification tasks, unsupervised training is
typically done with anomaly detection, clustering algorithms, and generative
functions. Such models are useful in discovering concealed inter-item
connections, resolving segmentation challenges, etc.
For example, financial institutions may use unsupervised learning to classify
clients into different classes. This helps to establish detailed guidelines for
interacting with each particular party. Unsupervised learning methods are
often used in rating algorithms, which is tailored to providing individualized
feedback. Unsupervised learning can be more complicated than supervised
learning, as the lack of supervision means that the problem has become less
established. The algorithm has a less concentrated understanding of what
patterns to search for.
You could liken this to a real-life situation. For example, if you learned to
trade forex by through the help of an instructor, you can learn the whole
process involved in forex trading easily by actually using the taught
experience of notes, key buy and sell signals and charts. But if you just were
self-thought, you have to learn and relearn while going through all the
winning and losing procedures alone. In fact, the time, losses, and effort
invested in the learning process will make you believe the procedure is a rigid
one.
By being unsupervised in a laissez-faire learning style, you begin with a fresh
slate with little prejudice and might even find a different, easier way to
resolve a problem. This is why unsupervised learning is often regarded as
information discovery. It is very effective when performing an analysis of
exploratory data, and it leverages density estimation methods to access
amusing structures in unmarked data. The most popular form of this is known
as clustering. There is also a decrease in dimensionality, latent variable
models, and anomaly detection. More advanced unsupervised techniques
include neural networks like Auto-encoders and Deep Belief Networks. Still,
we're not going to go through them since this chapter is more of an
introductory chapter.

Clustering: Unsupervised learning is generally employed in


clustering procedures. Clustering is the process of forming groups
with different attributes. It tries to identify different subgroups
within a dataset. Since this procedure is specified within
unsupervised learning, we are not really bound towards any list of
labels and, in fact, are allowed to select the number of clusters to
build. Well, this is both a mixed blessing with specific drawbacks
and benefits. The specification of a model with the appropriate
number of clusters must be carried out through an evidence-based
model screening process
Association: In Association Learning, you would need to unravel
the principles that define your results. Association laws are great
for situations where you want to locate similar objects.
Anomaly Detection: This simply known as the Detection of rare
or unusual objects that varies from most of the data. For instance,
your bank can employ this procedure detecting fraudulent
transactions on your card. Generally, every individual has a
specific spending pattern or behavior, and your daily spending
habits should fall into the usual spectrum of attitudes and values.
But if anyone wants to steal from you using your wallet, the
actions will be distinct from your usual practice. Anomaly
detection deploys unsupervised learning to isolate and identify
such weird events.
Dimensionality Reduction: The goal of dimensionality reduction
is to identify the most valuable attributes that can be employed to
reduce the initial feature set to a small, more powerful set that still
encrypts important data. For instance, when forecasting the
amount of visitors visit to the beach, we may use the temp, day,
the season, and the number of projects planned for that day as
inputs. In reality, however, the month may not be relevant for
forecasting the visitor numbers. Irrelevant attributes like this can
overwhelm the Machine Leaning algorithms and render them less
effective and accurate. The most important features are defined
and used with the use of dimensional reduction. Principal
Component Analysis (PCA) is a technique widely used.
Semi-Supervised Learning
Semi-supervised learning is a combination of both supervised and
unsupervised learning methods. In this case, the learning cycle is not tightly
scrutinized with sample outputs for every specific input, but still, the
algorithm is not allowed to do its own thing and don't. Semi-supervised
learning seeks the middle ground.
Since a low amount of labeled data is combined with a much larger unmarked
data set, it lowers the burden of providing enough labeled data. It also opens
up several more problems to be solved by machine learning.
Reinforcement Learning
The final form of machine learning is by far the most complex but yet my
favorite. It is less popular and more nuanced, but it has achieved amazing
results. It does not use marks as such, and rather uses incentives to learn.
If you're quite acquainted with psychology, I'm sure you must have heard a
lot about reinforcement learning. This machine learning is the most advanced
form of machine learning influenced by game theory and behavioral
psychology. Here, the algorithm must render informed decisions based on
data input while they are then "compensated" or "sanctioned" based on how
productive these decisions have been. By sequentially challenging
"compensations" and "sanctions," the algorithm changes its actions and
slowly seeks to produce better outcomes.
The whole concept of reinforcement learning can be likened to training a
puppy where good behavior is often rewarded with a treat, while Poor
behavior is being punished. This reward-motivated action is the key principle
in reinforcement learning. This is very similar to the way we humans learn
and understand certain behavior.
All through our lives, we obtain positive and negative signals while we take
lessons from them all the time. The nerves in our brain are among the many
ways we can get these messages. For example, if something good happens,
the nerves in our brains release positive neurotransmitters that make us
happy, and in fact, we are more likely to replicate such action. We barely
need continuous supervision to learn, as it is usually done in supervised
learning. Essentially, by just offering occasional feedback signals, we learn
effectively and more proactively.
In conclusion, we all profit in multiple ways from machine learning, which is
nearly a part of our everyday lives. If you've used Google to search the Web
today, you just benefited from machine learning. If you have been using your
credit card lately, you also benefited from machine learning algorithms that
verify user identities and avoid potentially fraudulent activity. More
importantly, if you've come across online stores that make customized
recommendations based on the items that you are looking at, well, that is
machine learning at work.
As stated earlier in this chapter, machine learning is changing the basic rules
and principles for the most business decision-making process. Machine
learning, for example, enables business owners and business-centric data
analysts to combine data from a wide range of sources, including social
media platforms and e-commerce platforms, to make accurate predictions
about goods that are likely to be sold in the future and more. These business
owners and stakeholders can then adapt their product creation as well as sales
and marketing plans to suit the ever-evolving market and consumer needs.
It is important to note that machine learning is now far from a niche
technology that most people believe it to be. Across a wide variety of sectors,
machine learning is working to create deeper data insights, make smarter
business decisions, and enhance process effectiveness while delivering better
goods and services to the market, some of which could include machine
learning. The use cases of machine learning in data analytics fields are nearly
unlimited. If you have a large amount of data, machine learning will help you
identify and understand the trends in it.
Chapter Ten: Designing Data Visualization That
Clearly Describes Insights
Any generic definition of data analytics states that its objective is to assist the
analyst in deriving meaning and value from raw data. Seeking and extracting
insights from raw data is the core of data analytics. However, these
observations, insights, or conclusions derived from such actions will mean
nothing if you don't know how to communicate your results to others.
Data visualization is an excellent way to visually convey insights, as well as
the value of your observations. Nonetheless, to design visualizations well,
you need to know and fully grasp your target market as well as the key
purpose for which you are designing.
You will need to consider the key types of data graphics that are accessible to
you, as well as the major advantages and disadvantages of each. In this
chapter, I introduce to you the basic principles of data visualization design.
Understanding Data Visualization
Every community is made up of a specific class of users, each with the
unique subject matter and analytical interest; hence you need to explain who
you are designing for and the reason behind the visualization. But before we
go any further, what exactly is data visualization? Simply put, Data
visualization is a digital depiction created to convey the context and
importance of data and data insights. Since data visualizations are designed
for a wide variety of different target audiences, they are unique.
To better understand the whole concept of data visualization, I have
explained the three key forms of data visualizations, as well as how to pick
the one that best fits the needs of your audience.
Data Storytelling For Corporate Decision-Makers
In some cases, data analysts may need to develop data visualizations for a
less technical audience. This may be tailored towards order assisting
corporate participants of this audience in making better educated corporate
decisions. Generally, the major aim of this form of visualization is to solve
the mystery behind the data to your viewer. Here, the viewer relies on the
analyst to make sense of the data behind the visualization as well as to
transform valuable insights into the graphical stories viewers can appreciate.
In data storytelling, the analyst role will be tailored towards constructing a
clutter-free, highly oriented visualization so that your audience can easily
derive insight without actually making a great deal of effort. These
visualizations are better presented in the form of text and images; however,
more expert stakeholders or decision-makers may like to have an adaptive
interface that they can use to do some experimentation.
Data Visualization For Analyst
When you're creating visualization for a bunch of rational, computational
researchers, you can construct data visualizations that are very open-ended.
The objective of this form of visualization is to enable members of the
audience to interpret the data and make reasonable conclusions visually.
When using data display tools, the goal here is to depict some sort of
contextual details that allows members of the audience to create their
conclusions. These visualizations should have more descriptive data and less
simplistic focuses so that viewers can get a more detailed interpretation of the
data for itself, and then form their own opinions. Such visualizations are
better presented as text and images or as engaging, interactive dashboards.
Building Data Art for Activists
While this is far from our main focus, it is also important to have some basic
knowledge of this as well. Data visualization can also be targeted towards
viewers like idealists, advocates, and dreamers.
When creating data visualization for this set of individuals, you may need to
make or prove a point! Presuming that the usual members of the audience are
not excessively critical is a key focus as well.
These sets of individuals perceive data visualization as a tool to make a point.
Data art is the way to go when designing for this audience. The main aim of
using data art is to amuse, threaten, disturb, or do whatever it takes to make a
clear, simple, thought-provoking argument. Data art has little or no plot and
provides little space for audiences or create their conclusions.
Data analysts have an ethical duty to ensure that data is always correctly
represented.
The data analyst must never misrepresent the details of the data to suit what
the viewers want to hear— and of course, data art is not even an exception!
Non-technical viewers don't even know, let alone see, potential problems.
They depend on the data analyst to provide truthful and reliable
representation, thus enhancing the degree of ethical obligation that the data
analyst must claim.
Meeting the Needs of Your Target Audience
To create realistic data visualization, you must get to understand your target
audience just so that your visualizations are tailored specifically to their
needs. But to make a design choice with your intended market in mind, you
will have to take a few steps to ensure that you fully understand the targeted
users for your data visualization.
To obtain the information you need about your market and your intent, adopt
the following procedures:
Brainstorm, Man!
1. You can visualize a particular member of your target group, and offer
as many informed assumptions as you can, regarding the motives of
that person. Assign this (fictitious) audience member a title and several
other distinguishing features. I often picture a 50-year-old single
mother named Cheryl.
2. Establish the intent of the visualization.
Close the aim of visualization by determining precisely what move
or conclusion you want the members of the audience to take as a
result of visualization.
3. Choose a practical template for this.
Examine the three main types of data visualization (explained earlier in this
chapter) and make a decision on the type that can best help you discover your
desired result.
The below is a detailed explanation on the outline above
Step 1: Brainstorm (about Cheryl)
To better brainstorm, take out a blank piece of paper and visualize your
hypothetical audience member (Cheryl) so that you can construct a much
more realistic and efficient data visualization. The answer to the following
questions can aid your understanding and better understand which can be
used to design your target market.
Take a snapshot of what Cheryl's normal day looks like — what she's doing
the time she wakes up in the morning, what she's doing during her lunchtime,
and what her job is like. Remember how Cheryl' will use your design as well.
To establish a clear understanding of who Cheryl is and the best way to
address her needs, the following questions can help get a piece of in-depth
knowledge on who Cheryl is:

Where does Cheryl work?


What's her occupation?
Her professional education or experience, if any?
How old is she?
Is she engaged or married?
Does she have babies, huh?

Where does she reside?


What social, financial, causal, or professional concerns are
relevant to Cheryl?
How does she see herself?
What kind of challenges and concerns will Cheryl have to contend
with on a daily basis?
How does the data analysis and visualization help to fix Cheryl's
job challenges or her relationship problems? How does this
increase her self-esteem?
Where do you intend conveying the diagram to Brenda — for
instance, is it via an email or through a meeting?
What does Cheryl intend to achieve with your data visualization?
Let's assume Cheryl is the director of the zoning department in Burlington
County. She's 50 years old and a single mother with two children that are
about to enter college. She is extremely involved in local politics and
potentially wishes to step up her career game to the Local Board of
commissioners. Cheryl derives most of her self-esteem from her work and
her passionate willingness to make competent management choices for her
unit.
Yet, Cheryl has managed her team according to her instincts, backed up by a
few diverse business structure papers. She's not very logical, but she
understands enough just to grasp what she's doing.
The issue is that Cheryl did not have the analysis software needed to show all
the appropriate data that she would be considering, although she doesn't have
the time and desire to analyze something on her own.
Cheryl is thrilled that you will be having a physical meeting with her next
weekend to reveal the data visualization options available to help her make
information-driven management decisions.
Step 2: Describe the Intent
After you have brainstormed about the potential audience member, you could
more easily figure out whatever it is you want to accomplish with the data
visualization. You need to determine if you are you trying to get customers to
feel a little bit for themselves or the world around them? Are you trying to
make a point about that? Are you trying to influence the decision-makers of
the company to make sound business decisions? Or do you just want to put
all the details out there, so that all audiences can make sense of and deduce
from what they're going to do?
Now let's return to the imaginary Cheryl: what choices or procedures do you
intend helping her to fulfill? Alright, you will have to make perfect sense of
her data, while they are presented her in ways that she can clearly understand.
What's going on inside the workings of her unit? Using your visualization,
you can try to direct Cheryl in producing the wisest and successful
management decisions.
Step 3: Use the Most Practical Form of Visualization For Your Task
This can be achieved by choosing from the three key styles of visualization:
data storytelling, data illustration, and data art. When you're visualizing for
corporate decision-makers, you may have to employ data storytelling to
inform the viewers specifically with attention and details to their business
line. If you're building visualizations for a gender equality group or a political
campaign, data art can help make a massive and powerful statement about
your data. Finally, whether you're building for technicians, biologists, or
mathematicians, stick to the data show so that these analytical styles have
more than enough space to find stuff out on their own.
Back to Cheryl— because she isn't overly logical, and since she relies on you
to help her make exceptional information-driven choices, you have to use
data storytelling strategies. Build either a fixed or an interactive visualization
of data with a few details. The graphic overlays of the design will tell a
straightforward story so that Cheryl doesn't have to dig on loads of nuances to
get into the heart of whatever you intend conveying to her regarding her data
and her area of business.
Picking the Most Suitable Design Style
Analytical styles may suggest that the only aim of data visualization is to
express facts and figures through graphs and charts — no elegance or design
is required. But even more, creative-minded people might demand that they
have to sense something to understand effectively. Strong data visualization
is neither uncreative and flat nor theoretical in its art. Instead, its elegance
and nature lie anywhere in the range between such two ends (uncreative and
flat).
To select the most suitable visual style, first, understand your viewer and then
determine how you want them to react to your visualization. When you're
trying to get the viewer to take a deeper, more critical plunge into
visualization, use a design style that generates a computational and precise
response to your audience. If you want the data visualization to boost the
enthusiasm of your viewers, use an intensely inspiring design style.
Creating a Numerical, Reliable Response
When designing data visualization for corporate groups, technicians,
researchers, or corporate decision-makers, make the design plain and elegant,
use data display or data storytelling visualization. To generate a rational,
computational feeling within your viewer include many charts, graphs, scatter
plots and line diagrams.
Color options ought to be very conventional and moderate. The appearance
and feeling should yell, "Corporate chic." Visualizations like this are
deployed to explain easily and plainly what's going on in the data — direct,
succinct, and to the target. The finest data visualizations in this type deliver
an exquisite look and feel.
Garnering a Strong Emotional Reaction
When creating a data visualization to sway or convince people, you can
integrate design art that conjures up an emotional reaction from your intended
audience. These visualizations generally meet the definition of data art, but
it’s kind of intense emotional response can also be inspired by a highly
creative piece of data storytelling. Emotionally contentious data
visualizations frequently reinforce the perspective of a certain side of the
social, political, or environmental problem.
Adding Context to Your Visualization
Adding detail allows your audience to appreciate the importance and
comparative value of the knowledge that data visualization seeks to
communicate. Adding meaning to the estimation of accurate data
visualization types helps to build a level of relative perspective. However,
when using pure data art, you can exclude context because, in data art, you're
simply looking to make a single point, so you wouldn't want to add details
that will distract your audience from the actual meaning of the visualization.
In data visualization, you can provide appropriate contextual data for the key
indicators displayed in your data visualization— for instance, in a scenario
where you are producing a data visualization that shows the transformation
levels for e-commerce transactions. The main measure will be the ratio of
users who turn to buyers by making a purchase. Contextual data applicable to
this measure can include the drop-off rate of the shopping cart, the average
number of visits before the user makes the transaction, the average page
number visited before the transaction takes place, or the different pages
visited before the client chooses to convert.
Adding conceptual data or context often decentralizes the objective of the
data visualization, so apply this context only to visualizations that are
designed for an analytical audience. Such people are in a stronger place to
fully integrate extra knowledge while they are used in drawing their
conclusions; for other forms of viewers, context is only a diversion.
Choosing the Best Data Graphic Type For Your Visualization
Your option of data graphic form can either make or mar the data
visualization. You may need to display several different aspects of your
results, so you can customize various graphic groups and styles. Also, among
the same class, some graphic styles achieve better results than many others;
thus, establish test depictions to detect which graphic style communicates the
simplest and most evident meaning.
This chapter only highlights the most widely used graphic forms. Do not
move too far away from the main strip. The further you get away from
common images, the difficult it is for people to grasp the details you're trying
to communicate.
Use the graphic style that shows the data patterns you're trying to show most
significantly. You can view the same data pattern in several ways, but certain
approaches provide a visual impression more accurately than many others.
The goal is to provide a simple, informative graphical message to your
viewer so that users can use visualization to allow them to make meaning out
of the data displayed.
Standard graphics, comparative graphics, statistical graphs, topology
structures, and geographic graphs and maps are among the most popular
forms of data graphics. The few sections below will take a look at each form
in detail.
Standard Chart Graphics
When creating data visualizations for non-analytical people, it is best to make
use of standard chart graphics. The much more unique and complicated the
graphics are, the tougher it is for non-analytical audiences to grasp the
message depicted fully. In contrary to what most people are made to believe
not all traditional chart forms are bland— you have quite a range to select
from, as the following list indicates:

Area: Area charts are easy to plot yet; it serves as an interesting


way to assess and compare element attributes. You can use them
to efficiently tell a graphical tale more specifically if you are using
data storytelling and data show.
Bar: Bar charts are also an easy way to equate and make
comparisons of variables values in the same group. Bar charts are
ideally adapted for storytelling and data display.
Line: Line charts most frequently display shifts in time-series
results, but may also map the connection between two or perhaps
three parameters. Line charts are so flexible that they can be used
in all forms of data visualization layout.
Pie: Pie chart graphics, which are one of the most widely utilized,
offer an easy way of comparing variables values of the same
group. Highly analytical individuals tend to chuckle at these
graphics, specifically because they appear so easy, so you may
want to omit them from data-showing visualizations.
Comparative Graphics
Comparative graphics show the relative importance of different parameters in
a common group or the relationship of variables within several common
categories. The main distinction between comparative graphics and standard
graphics is that comparative graphics provide a way of comparing more than
one variable and class at the same time. Standard graphics, on the other hand,
offer a way to display and contrast only the discrepancy that exists between a
single variable in a single category.
Comparative graphics are intended for viewers who are at least somewhat
analytical so that you can conveniently use such visuals in either data
storytelling or data presentation.
This list briefly explains a few forms of common comparative graphics:

Bubble plots utilize bubble size and color to display the


connection that exists between three variables of the same group.
Packed circle diagrams employ both circle size and grouping to
represent the interaction between groups, parameters, and
comparative variable values.
Gantt charts are bar charts that employ horizontal bars to illustrate
scheduling specifications for project management purposes. This
form of graphic is valuable when designing a project delivery
plan. This is also useful in deciding the series in which projects
are to be performed to accomplish the delivery periods.
Stacked charts are used in comparing different parameter values in
the same group. To make sure that comparisons are easily
displayed, avoid the temptation of including too many variables.
Statistical Plots
Statistical plots, which display the outcomes of statistical analysis, are
typically only valuable to a highly analytical public (and are not suitable for
data art). These plot options are listed below:

Histogram: A graphic that displays the occurrence of a variable


and the representation as rectangles on a map, a histogram will
enable you to easily get a grasp on the spread and occurrence of
data in a dataset.
Scatter Plot: A brilliant way to easily discover important patterns
and outliers in a data set is through scatter plot data points, which
are usually based on its x-and y-values to clearly and expose any
noticeable trends. If your aim is storytelling or data showcasing,
you can start by creating a simple scatter plot for sections in the
dataset that may be intriguing — locations that may reveal
meaningful relationships or generate compelling stories.
Some Popular Data Visualization Tools
Some of the most common data visualization tools that have been employed
by most data analytics in the past may include:

D3. js.
Google Charts,
Tableau,
Grafana,
Chartist. js,
FusionCharts,
Datawrapper,
Infogram, and
ChartBlocks
These tools are common when compared to other data visualization tools
because of the various visualization styles they offer, and their ability to
visualize large data sets.
in summary, If you want to build simple and elegant visual communications
with the correct data graphics, follow the following steps in this section:

Identify the questions that your data visualization seeks to address,


and then review the visualization to see if the answers to the
questions identified can be answered with your visualizations.
Include consumers and media when deciding where the data
visualization should be employed.
Explore the data visualization at the last moment to confirm that
the meaning is communicated using only the data graphics.
Chapter Eleven: Exploring Data Analytic Methods
Using R
Generally, an effective analytics project requires a solid knowledge of the
data. It also includes a set of tools for mining and data presentation. Such
tasks include analysis of data in terms of simple metrics and the development
of graphs and plots for visualization and recognition of relationships and
patterns. Many tools are available for data discovery, retrieval, modeling, and
presentation. Due to its popularity and flexibility, the open-source R is often
used to explain most of the analytical activities and models described in this
book.
This chapter sets out the basic features of the R programming language. The
first section provides a detailed summary of how to use R for data collection,
sorting, and filtering, as well as how to obtain some simple descriptive
dataset statistics. The second segment discusses the use of R to conduct
exploratory data analysis activities through visualization techniques. The
final section reflects briefly on inferential statistics, like theories for testing
and study of variance in R.
Understanding the Open Source R Programming Language
R is an open-source, free statistical software framework that has been widely
used in the data analytics field over the last decade. Its wide versatility is one
of the many reasons I have decided to explain this analytical tool, among
other commonly used tools in data analytics. Data analysts who prefer R
usually do so because of their large amount of versatility and data
visualization capabilities — a capacity that simply cannot be replicated in
Python and other common tools for data analytics. When it comes to data
analytics experts, in particular, R's user base is wider than Python's.
R programming language or R analytics is a free, open-source application
that can be utilized in the rigorous analytical procedure. It is designed
specifically for, and commonly used for, statistical analysis and data
collection (applicable to the first 3 phases of data analytics lifecycle). More
precisely, it is not only used to analyze data but also to develop
applications that can be utilized to execute statistical analysis effectively.
In addition to the regular statistical methods, R also provides a graphical user
interface. As such, it can be utilized in a wide variety of computational
modeling, like traditional statistical studies, linear/non-linear modeling, time-
series analysis, and many more. The following are some R key concepts you
should familiarize yourself with before going deeper into the world of data
analytics.
R’s Common Vocabulary
Although R-associated vocabulary can at the initial stage appear complex,
sophisticated, and ‘weird’ specifically for new analysts, you can easily master
the whole process involved through steady study and repetitions. For
example, you can operate R in one of these two forms:

Non-Interactive: Here, you run the R code as a .r file (this is a


common file extension allocated to the script files that are
designed for R program execution) directly from the command
line.
Interactive: Generally speaking, if you work in a software
application that interacts with you by asking you to insert your
data and R code. You can upload data sets or specifically access
the raw data in an interactive mode R session. This is done by
assigning names to variables and data objects; and using
functions, operators, as well as built-in iterates that can be used in
gaining clarity into your source data.
R is an object-oriented language - with this, we can easily guess it right that
‘classes’ belong to the different parts that form this language. Class has its
own unique meaning and function; a class example is generally depicted as
an instance of that class, and because of this, it inherits the characteristics of
that class. Classes are polymorphic in nature: a class subclass can have its
own set of specific actions or attributes while sharing some of the parent
class's same functionality. Consider print function: print () R to demonstrate
this idea. Since this function is polymorphic and strictly relies on the class of
the object it is instructed to print, it operates differently.
So, in many classes, this function and several others execute the very same
generic function but vary a bit based on class.
R Studio
R program utilizes a command-line interface (CLI). This command-line
interface is quite similar to a BASH shell in Linux or an embedded variant of
scripting languages like Python. For Microsoft users, R works with RGui.exe
that offers a basic graphical interface to its users. However, in order to
increase the efficiency of generating, implementing, and debugging of R
code, some new Graphical interface has been developed for R. the most
popular and commonly used graphical interface is the R Studio.
RStudio is an Interactive Design Experience that helps users to communicate
more easily with R. it is similar to the generic RGui but much more
consumer-friendly. The R studio has multiple walk-down menu, several tab
windows, and a lot of configuration options. You'll notice three windows the
very first time you open RStudio. The forth window is closed by design but
can be accessed by tapping on the Document drop-down menu. You can do
this by clicking on New Tab, and afterward R Script. the following briefly
outlines the R studio windows:

Workspace: Just like the name implies, the workspace is where


most R projects are carried out. It is your current working
environment. This is where all variables and datasets in the R
environment are outlined.
Scripts: This is where R codes are written and saved.
Plots: This is where the plots generated by the R code are
displayed. It also provides a straightforward procedure for
exporting plots that have been developed for future study.
Console: The console provides information about the
implemented R code and its outcomes. It has also been utilized in
obtaining useful insights on R.
Understanding R Data Types
Just Like other programming languages, open-source R employs various data
types in carrying out its analysis. In fact, R recognizes more than twelve data
types. Most of these data types are linked to the recognition of their roles and
other objects. Numerical, logical, and character variables are the most
common data types utilized by the data analyst in carrying out its analytical
procedure. These data types and variabl4es are explained in the following
below:
Numeric: This data form is used for numerical values. It is also
the generic data form for R numbers. Examples of numeric data
types may include 2, 5.3,-22,-47.023, etc. the numeric values can
be used for all data attributes. however, it works well for ratio
scale of measurements or attributes
Logical: The logical data type is commonly used to store logical
values of true or false. It is generally used for nominal data
attributes.
Character: This character data types stores character values or
strings. This R string value can include alphabets, figures, and
symbols. The best way to reflect that an R-value is in a character
data type is to place the value within an individual or double
inverted commas. Character can be used as nominal or ordinal
data.
Exploring Various R Variables/Objects
The above-listed data types can be contained in a variety of variable types or
objects. For example, the equivalent of a variable in Excel will be rows,
columns, or data tables. The important variables (also known as objects)
that are commonly used in open-source R may include:

Vector: A Vector Comprises one or more objects, and functions


like some kind of column or a row of data. Vectors may represent
all of the data types described above, yet each vector is processed
or encoded as a single element. The vector c (1, 3, 2, 8, and 5) is,
by definition, a double-type numeric vector, but c (1, 3, 2, 8, and
5, "name") is a vector character or a vector in which all data is
kept as a character type. In the latter, the figures would be
represented as characters (that is on a nominal data attribute)
instead of numbers.
Factor: Factor is a unique character form of a vector, in which the
text strings represent the degree of the factor and are implicitly
represented as an integer of the existence of each factor. Factors
may be viewed as nominal data if the data ordering does not
count, or in some cases, as an ordinal data whenever the data are
ordering counts.
Array: Array is a generalization of vectors from a
specific dimension to several dimensions. The dimensions of
this array must be predetermined and may have any amount of
dimensions. Just like vectors, all array components should be of
similar data type.
Matrix: You can think of a matrix as a collection of vectors. A
matrix can appear in any mode (that is either character, numeric or
Boolean), but it is important to note that all matrix elements
appear in identical mode. In its simple terms, a matrix can be
defined as the unique type of array containing numerical or
character matrix attributes. This can only be two-dimensional,
with rows and columns, in which all columns must have a similar
data form, and each column must have a similar number of rows.
R includes a variety of features related to the manipulation of
matrices, like transposing, multiplying matrix, and calculating
common and patented values.
List: List are vectors with elements of other R objects, whereby
each object in the list could be of any a separate data type,
with object of varying length and size when compared to the other
objects. Lists can also retain all other forms of data, along with
other lists.
Data Frame: Data frames (or the tidyverse extension of the data
frame, the tibble) seem to be the most commonly used variable
type for industrial analysts. A data frame is the list equivalent to
the matrix: it is an m×n list in which all columns have to be
vectors with an equal number of rows. In contrast to matrixes,
columns may contain various data types, but rows and
columns must be labeled. If not explicitly labeled, R will instantly
label the rows using their row numbers and columns based on the
data allocated to the column.
Data frames are usually used to archive data types that are widely used
by engineers and scientists and are the closest match in R to the Excel
spreadsheet. Generally, data frames consist of several numbers of
variables and one or several columns of numeric data.
In R, there are two ways to obtain elements of vectors, matrices, and lists:
Single brackets [ ]
Double brackets [[]] include one single feature.
R users often argue concerning the correct use of the indexing brackets.
The double bracket usually has a range of benefits over the single bracket.
For instance, when you insert an index that is out of limits, the double bracket
displays an error message. Nevertheless, if you wish to define more than
a single component of a vector, matrix, or list, you can use one bracket.
In open-source R, all variables are objects, and R differentiates between
these objects using their internal storage structure and class designation,
which can be accessed through the use of type of () and class () functions. R
functions are also objects, and new objects can be specified by the users to
monitor the feedback from functions.
Now that you have some sort of in-depth understanding of the most
commonly used vocabulary of R's, I bet you are definitely willing to see how
it works with some actual analytics. I will be discussing some common
analytical procedures that can be easily performed using R.
Taking a Quick Peep Into Functions and Operations
While writing your functions, you can choose one of two methods: a fast,
simple method and a method which is more complicated, but essentially more
valuable.
It is important to note that choosing either solution helps you achieve the very
same goal and, but each strategy is beneficial in its own ways. If you want to
call a function and produce a fast and simple result, and if you believe you
won’t need the function later, use Method 1. On the other hand, if you want
to write a function that you can employ for various purposes and use in the
future with various datasets, then Method 2 is sure your best bet.
The biggest benefit of using functions in R is because they make writing
more simple, more succinct code for data analysis easier to read and more
widely versatile. At the most basic point, however, R simply uses functions
for operators’ implementation.
When adding operators and call functions, both work in the same way; their
different syntaxes can differentiate the two techniques. R makes use of a
range of the same operators found in other programming languages.
Understanding Common Statistical and Analytical Packages In R
R has a variety of 'quick-to-install' packages and functions, some of which
are very valuable in data analytics. In an R framework, packages are bundles
made up of different functions, data, and code suitable for conducting various
types of analyzes or analysis sets. The CRAN website shows the
latest downloadable packages at http:/cran.r-project.org/web/packages,
alongside guidance on how to access, install, and launch them. I address
some common packages in this section and afterward dig further into the
functionality of a few of the most sophisticated packages.
These R packages will aid in performing certain tasks like prediction,
analysis of multivariates, and variable analysis. In this section, I will provide
a brief rundown of some of the most common packages that are valuable in
performing some analytical functions.
The R's prediction kit includes numerous prediction functions that can be
modified for ARIMA (Autoregressive Integrated Moving Average Time
Series Projection) or other forms of univariate time series forecasting. Or,
maybe you want to use R to control consistency. For consistency and
statistical cycle control, you can use the Quality Control Charts (qcc) kit by
R.
In data analytics practice, you would likely benefit from virtually any kit that
focuses on multivariate regression. If you want logistical regression, you
could use the R's multinomial logit model (mlogit), in which occurrences of a
specified class are used to "train" the algorithm to classify groups of other
occurrences whose groups are not specified.
You may use factor analysis if you choose to use R to take homogenous data
and classify which of its factors are important for some particular reason.
Assume that you own a cafe, to explain the basic principle of factor analysis
further. And you may want to do whatever you can to ensure that your
customer's level of satisfaction is as strong as possible, correct? Okay, factor
analysis will help you decide which underlying factors have the greatest
effect on customer satisfaction rankings. In essence, those may unify into the
general factors of decor, the architecture of restaurants, and the
attractiveness/behavior/expertise of employees.
Only a few analysts may manually input the data into R. Quite often, data is
imported either through Microsoft Excel or from a relational database. Driver
modules are available for importing data from different types of relational
databases, including RSQLite, RPostgreSQL, RMySQL and RODBC, and
modules for several other RDBMSs. One of the benefits of R is how it
rewards people with the ability to build graphical images of publishing
quality or, in some cases, data visualizations that can be employed in gaining
reasonable insights into your data. The ggplot2 kit delivers a wide variety of
ways that can be employed when displaying the data.
Exploring Various Packages for Visualization, Graphing, and Mapping in
R
Well, look no further than R's ggplot2 package if you are searching for a
simple and effective way to generate nice-looking data visualizations that can
be used to extract and convey observations that are in your datasets. It has
been developed to assist analysts in constructing all sorts of R data graphics,
like histograms, scatter plots, bar charts, box plots, and plots of density. It
also provides a wide range of customization options like color, style, clarity,
and line thickness choices. Ggplot2 is helpful if you really want to display
data. However, if the idea is to perform data storytelling or data art, it is far
from the best choice.
Conclusion
Data analytics is the 21st-century buzzword. It investigates a vast amount of
invisible trends, associations, and other observations that are essential to the
subject matter.
In every industry, data analytics and business-centric analytics are in high
demand, with incredible rewards, and the opportunity to produce market
insights that can drive growth in any field and industry. Now, this is exactly
why you need to get your hands ‘dirty’ with proven approaches and
techniques employed in data analytics.
Ideally, data analytics deals with almost anything data-related, depending on
the subject matter or our area of interest
The entire data analytics process and lifecycle involves data discovery,
preparation, planning, model building, communication of outcome, and
implementation. However, this important analytical lifecycle cannot be
properly understood without exploring some important components that
make up this lifecycle.
The best part of it all - I have covered these essential components and
procedures in this book.
The first and second chapters of this book explore the two most important
components of data analytics (data and big data). Understanding this data
analytics component will help fast-track your understanding of the realms of
data analytics.
I also explained the most commonly confused concepts of data analytics in
chapters three, four, and five, and how they relate to our subject matter – data
analytics.
Essentially, chapters one through five explores the fundamentals while
working you through all you need to know about data analytics. On the other
hand, chapters six through twelve explores various techniques and procedures
employed in data analytics, and how you can utilize these techniques in
gaining data insights for your subject area.
Note: Just because you've completed this book doesn't mean that there's
nothing new to discover about the subject. Learning is continuous, and
practice makes perfect!
Bibliography
302 Found. (n.d.). Retrieved March 18, 2020, from
https://towardsdatascience.com/why-data-analytics-is-gaining-hype-in-the-
21st-century-b7b1ca289f09
Anderson, Chris. 2008. The Long Tail: Why the Future of Business Is Selling
Less of More. Rev. ed. New York: Hachette Books.
Baldridge, Jason. 2015. “Machine Learning and Human Bias: An Uneasy
Pair.”
TechCrunch, August 2. http://social.techcrunch.com/2015/08/02/machine
-Learning-and-human-bias-an-uneasy-pair.
Big Data Analytics - What it is and why it matters. (n.d.). Retrieved March
15, 2020, from https://www.sas.com/en_us/insights/analytics/big-data-
analytics.html
Boisseau, J. P. A., L. W., Ph.D. (2019, February 21). Enterprise AI: Data
Analytics, Data Science, and Machine Learning. Retrieved March 15, 2020,
from https://www.cio.com/article/3342421/enterprise-ai-data-analytics-data-
science-
and-machine-learning.html
Chapter 3: Data Cleaning Steps and Techniques - Data Science Primer.
(2019, July 9). Retrieved March 16, 2020, from
https://elitedatascience.com/data-
cleaning
Data Analytics vs. Business Analytics. (2019, August 14). Retrieved April
18, 2020, from https://www.mastersindatascience.org/careers/data-analytics-
vs-business-analytics/
Brown, Meta, S., 2014. Data Mining for Dummies. New York: Wiley. http://
www.wiley.com/WileyCDA/WileyTitle/productCd-1118893174,subjectCd
-STB0.html.
E. (2019, August 30). What is Business Analytics? All you Need to Know.
Retrieved March 17, 2020, from https://www.edureka.co/blog/what-is-
business-
analytics
Editor. (2020, February 27). Machine Learning: Bridging Between Business
and Data Science. Retrieved March 14, 2020, from
https://www.altexsoft.com/whitepapers/machine-learning-bridging-between-
business-and-data-science/
Lewandowski, P. (2019, December 5). What is data cleaning, and why is it
important? Retrieved March 16, 2020, from
https://sunscrapers.com/blog/why-
is-clean-data-so-important-for-analytics-and-business-intelligence/
Miller, K. (2019, May 8). What is Data Analytics? - Talend Cloud
Integration. Retrieved March 18, 2020, from
https://www.talend.com/resources/what-is-
data-analytics/
Peerzada, S. (2020, January 21). 8 Ways to Clean Data Using Data Cleaning
Techniques. Retrieved March 15, 2020, from
https://www.digitalvidya.com/blog/data-cleaning-techniques/
Rouse, M. (2019, July 10). business analytics (BA). Retrieved March 16,
2020, from
https://searchbusinessanalytics.techtarget.com/definition/business-
analytics-BA
Sarangam, A. (2019, October 9). All you need to know about Data Cleaning.
Retrieved March 16, 2020, from https://analyticstraining.com/all-you-need-
to-
know-about-data-cleaning/
Verma, A. (2018, March 19). Why is Big Data Analytics So Important?
Retrieved
from https://www.whizlabs.com/blog/big-data-analytics-importance/

You might also like