0% found this document useful (0 votes)
6 views

Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms

Predictive analytics forecasts future events by analyzing past data, with data mining serving as a key technology for extracting useful patterns from large datasets. The data mining process involves several steps, including understanding business needs, preparing data, building models, and evaluating results, and can be applied across various industries such as banking, healthcare, and retail. The CRISP-DM framework standardizes the data mining process, while supervised and unsupervised learning techniques are used to classify and discover patterns in data.

Uploaded by

stanspatch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms

Predictive analytics forecasts future events by analyzing past data, with data mining serving as a key technology for extracting useful patterns from large datasets. The data mining process involves several steps, including understanding business needs, preparing data, building models, and evaluating results, and can be applied across various industries such as banking, healthcare, and retail. The CRISP-DM framework standardizes the data mining process, while supervised and unsupervised learning techniques are used to classify and discover patterns in data.

Uploaded by

stanspatch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PREDICTIVE ANALYTICS aims to determine what is likely to happen in the future.

It aims at
forecasting future events by looking at past data in order to predict the future.

predictive
analytics

predictive
data mining text analytics web analytics
modelling

LO 1 Define data mining as an enabling technology for business analytics.


DATA MINING is a term used to
describe discovering or ‘mining’
knowledge from large amounts of
data. It is a process that uses
statistical, mathematical, and
artificial intelligence techniques to
extract and identify useful
information and subsequent
knowledge or patterns from large
databases.

Data mining is defined as the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data stored in structured databases:
►process – implies that data mining comprises of many iterative steps
►nontrivial – means that some experimentation-type search or inference is involved
►valid – means that the discovered patterns should hold true on new data with a
sufficient degree of certainty
►novel – means that the patterns are not previously known to the user within the context
of the system being analysed
►potentially useful – means that the discovered patterns should lead to some benefit to
the user or task
►ultimately understandable – means that the pattern should make business sense that
leads to the user saying “it makes sense”
Data mining is a blend of multiple disciplines, including: statistics, artificial intelligence,
machine learning and pattern recognition, information visualisation, database
management and data warehousing, and management science and information
systems.
@study_ingmadesimple Luca du Toit
FACTORS THAT HAVE INCREASED THE POPULARITY OF DATA MINING:
►more intense competition on a global scale, driven by customers’ ever-changing needs
and wants in an increasingly saturated marketplace
►general recognition of the untapped value hidden in large data sources
►consolidation and integration of database records which enables a single view of
customers, vendors, transactions, etc
►consolidation of databases and other data repositories into a single location in the form
of a data warehouse
►the exponential increase in data processing and storage technologies
►significant reduction in the cost of hardware and software for data storage and
processing
►movement toward the demassification of business practices – the conversion of
information resources into nonphysical form
A company that effectively uses data mining tools and techniques can acquire and
maintain a strategic competitive advantage. Data mining offers organisations an
indispensable decision-enhancing environment to exploit new opportunities by
transforming data into a strategic weapon.
HOW DOES DATA MINING WORK?
Using existing and relevant data obtained from within and outside the organisation, data
mining builds models to discover patters among the attributes presented in the data set.
Models are the mathematical representations (linear or nonlinear relationships) that
identify the patterns among the attributes of the things described within the dataset.
Data mining seeks to identify the following major types of patterns:
►associations – this finds the commonly cooccurring groupings of things e.g. beer and
diapers (claims that men who go to a store to buy diapers are also likely to buy beer)
►predictions – this tells the nature of future occurrences of certain events based on what
has happened in the past e.g. forecasting the max temperature of a particular day
►clusters – this identifies natural groupings of things based on their known characteristics
e.g. assigning customers in different segments based on their demographics and past
purchase behaviours
►sequential relationships – this discovers time-ordered events e.g. predicting that an
existing banking customer who already has a checking account will open a savings
account followed by an investment account within a year

LO 2 Understand the objectives and benefits of data mining.


CHARACTERISTICS AND OBJECTIVES OF DATA MINING:
►data are often buried deep within very large databases, which sometimes contain
data from several years and the data may be presented in a variety of formats
►the data mining environment is usually a client/server architecture or a web-based
information system architecture
►sophisticated new tools help to remove the information ore buried in corporate files or
archival public records – finding it involves massaging and synchronising the data to get
the right results
►the miner is often an end user, empowered by data drills and other powerful query tools
to ask ad hoc questions and obtain answers quickly with little or no programming skill

@study_ingmadesimple Luca du Toit


►striking it rich often involves finding an unexpected result and requires end users to think
creatively throughout the process and interpreting the findings
►data mining tools are readily combined with spreadsheets and other software
development tools so that the mined data can be analysed and developed quickly and
easily
►due to the large amounts of data and massive search efforts, it is sometimes necessary
to use parallel processing for data mining
LO 3 Become familiar with the wide range of applications of data mining.

DATA MINING APPLICATIONS:


►customer relationship management (CRM) – Data mining can be used to: (1) identify
most likely buyers of new products and services, (2) understand the root causes of
customer attrition to improve customer retention, (3) discover time-variant associations
between products and services to maximise sales and customer value, and (4) identify
the most profitable customers and their preferential needs to strengthen relationships and
to maximise sales.
►banking – Data mining can help banks with the following: (1) automating the loan
application process by accurately predicting the most probable defaulters, (2) detecting
fraudulent credit card and online banking transactions, (3) identifying ways to maximise
customer value by selling them products and services that they are most likely to buy, and
(4) optimising the cash return by accurately forecasting the cash flow on banking entities.
►retailing and logistics – Data mining can be used to: (1) predict accurate sales volumes
at specific retail locations to determine correct inventory levels, (2) identify sales
relationships between different products to improve the store layout and optimise sales
promotions, (3) forecast consumption levels of different product types based on seasonal
and environmental conditions to optimise logistics and maximise sales, and (4) discover
interesting patterns in the movement of products in a supply chain by analysing sensory
and radio-frequency identification data.
►manufacturing and production – Manufacturers can use data mining to: (1) predict
machinery failures before they occur through the use of sensory data, (2) identify
anomalies and commonalities in production systems to optimise manufacturing capacity,
and (3) discover novel patterns to identify and improve product quality.
►brokerage and securities trading – Brokers and traders use data mining to: (1) predict
when and how much certain bond prices will change, (2) forecast the range and
direction of stock fluctuations, (3) assess the effect of particular issues and events on
overall market movements, and (4) identify and prevent fraudulent activities in securities
trading.
►insurance – The insurance industry uses data mining techniques to: (1) forecast claim
amounts for property and medical coverage costs for better business planning, (2)
determine optimal rate plans based on the analysis of claims and customer data, (3)
predict which customers are more likely to buy new policies with special features, (4)
identify and precent incorrect claim payments and fraudulent activities.
►computer hardware and software – Data mining can be used to (1) predict disk failures
well before they actually occur, (2) identify and filter unwanted web content and email
messages, (3) detect and prevent computer network security breaches, and (4) identify
potentially unsecure software products.

@study_ingmadesimple Luca du Toit


►government and defence – Data mining can be used to: (1) forecast the cost of
moving military personnel and equipment, (2) predict an adversary’s moves and develop
more successful strategies for military engagements, (3) predict resource consumption for
better planning and budgeting, and (4) identify classes of unique experiences, strategies,
and lessons learned from military operations for better knowledge sharing throughout the
organisation.
►travel industry – Data mining is used to: (1) predict sales of different services in order to
optimally price services to maximise revenues as a function of time-varying transactions,
(2) forecast demand at different locations to better allocate limited organisational
resources, (3) identify the most profitable customers and provide them with personalised
services to maintain their repeat business, and (4) retain valuable employees by
identifying and acting on the root causes for attrition.
►healthcare – Data mining can be used to: (1) identify people without health insurance
and the factors underlying this undesired phenomenon, (2) identify novel cost-benefit
relationships between different treatments to develop more effective strategies, (3)
forecast the level and the time of demand at different service locations to optimally
allocate organisational resources, and (4) understand the underlying reasons for customer
and employee attrition.
►medicine – Data mining can be used to: (1) identify novel patterns to improve
survivability of patients with cancer, (2) predict success rates of organ transplantation
patients to develop better organ donor matching policies, (3) identify the functions of
different genes in the human chromosome, and (4) discover the relationships between
symptoms and illnesses to help medical professionals make informed and correct
decisions in a timely manner.
►entertainment industry – Data mining is used to: (1) analyse viewer data to decide what
programs to show during prime time and how to maximise returns by knowing where to
insert advertisements, (2) predict the financial success of movies before they are
produced to make investment decisions and to optimise the returns, (3) forecast the
demand at different locations and different times to better schedule entertainment
events and to optimally allocate resources, and (4) develop optimal pricing policies to
maximise revenues.
►homeland security and law enforcement – Data mining is used to: (1) identify patterns
of terrorist behaviours, (2) discover crime patterns to help solve criminal cases in a timely
manner, (3) predict and eliminate potential biological and chemical attacks to the
nation’s critical infrastructure by analysing special-purpose sensory data, and (4) identify
and stop malicious attacks on critical information infrastructures.

LO 4 Learn the standardised data mining processes.


CRISP-DM (Cross-Industry Standard Process for Data Mining) is one of the most preferred
techniques used to build data mining projects. It is a cross-industry process for data mining
and provides a structured approach to planning a data mining project.

@study_ingmadesimple Luca du Toit


STEPS IN THE CRISP-DM DATA MINING PROCESS:
(1) Business understanding – What does the business need?
This step focuses on understanding the objectives and requirements of the project. The
first step is all about understanding what you want to accomplish, from a business
perspective.
(2) Data understanding – What data do we have/ need? Is it clean?
This step involves identifying the relevant data from many available databases. The data
understanding step starts with an initial data collection and proceeds with activities to get
familiar with the data, identify data quality problems, discover first insights into the data,
or detect interesting subsets to form hypotheses for hidden information.
(3) Data preparation – How do we organise the data for modelling?
This step covers all activities to construct the final data set from the initial raw and focuses
on taking the data identified in the previous step and preparing it for analysis by data
mining methods. This step accounts for roughly 80% of the total time spent on a data
mining project.
(4) Model building – What modelling techniques should we apply?
This step is where various modelling techniques are selected and applied to an already
prepared data set to address the specific business need. This phase also encompasses
the assessment and comparative analysis of the various models built.
(5) Testing and evaluation – Which model best meets the business objectives?
In this step, the developed models are assessed and evaluated for their accuracy and
generality. This step assesses the degree to which the selected model or models meet the
business objectives and to what extent.
@study_ingmadesimple Luca du Toit
(6) Deployment – How do stakeholders access the results?
This step is where data mining pays off. The knowledge gained will need to be organized
and presented in a way that the customer can use it. However, depending on the
requirements, the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data mining process across the enterprise.

LO 5 Learn different methods and algorithms of data mining.

Data mining tasks can be classified into three main catgeories: prediction, association,
and clustering. Based on the way in which the patterns are extracted from the historical
data, the learning algorithms of data mining methods can be classified as either
supervised or unsupervised.

We must first determine what we want to do: are we trying to predict something or
describe something?

Predictive data mining


tasks come up with a
model from the available
dataset that is helpful in
predicting unknown or
future values of another
dataset of interest.
Descriptive data mining
tasks find data describing
patterns and comes up
with new, significant
information from the
available dataset.

@study_ingmadesimple Luca du Toit


SUPERVISED LEARNING is the data mining task of inferring a function from labelled training
data. Supervised learning models try predict the output/ dependent variable/ supervisory
signal using the input/ independent variable/ vector given. The goal of supervised
learning is to train the model so that it can predict the output when it is given new data.
Supervised learning needs supervision to train the model, which is similar to as a student
learns things in the presence of a teacher. Supervised learning can be used for two types
of problems: Classification and Regression (when dealing with predictive data mining
tasks we make use of supervised learning).

UNSUPERVISED LEARNING is another machine learning method in which models are


trained using unlabelled data. Unsupervised learning models do not have a dependent
variable and find the hidden patterns from the input/ independent variable data. The
goal of unsupervised learning is to find the hidden patterns and useful insights from the
unknown dataset. Unsupervised learning does not need any supervision to train the
model. Instead, it finds patterns from the data by its own. Unsupervised learning can be
used for two types of problems: Clustering and Association (when dealing with descriptive
data mining tasks we make use of unsupervised learning).
WHAT IS THE DIFFERENCE BETWEEN SUPERVISED AND UNSUPERVISED?
Supervised tasks require different techniques than those used for unsupervised tasks and
the results are often much more useful. In supervised learning, the algorithm “learns” from
the training dataset by iteratively making predictions on the data and adjusting for the
correct answer. While supervised learning models tend to be more accurate than
unsupervised learning models, they require upfront human intervention to label the data
appropriately. Unsupervised learning models, in contrast, work on their own to discover
the inherent structure of unlabelled data. Note that they still require some human
intervention for validating output variables. To put it simply, supervised learning uses
labelled input and output data, while an unsupervised learning algorithm does not.
CLASSIFICATION/ SUPERVISED INDUCTION is the process of arranging the collected data
into classes and into subclasses according to their common characteristics. It is the
grouping of related facts into classes. The goal of classification is to analyse the historical
data stored in a database and automatically generate a model that can predict future
behaviour. It is the most frequently used data mining method for real-world problems.
Classification is a type of machine learning (a method of data analysis that automates
analytical model building). Classification is the most common of all data mining tasks.
Example: one could use classification to predict whether the weather on a particular day
will be sunny, rainy, or cloudy (classification uses class labels) or a classification model
could be used to identify loan applicants as low, medium, or high credit risks

REGRESSION is a way of mathematically sorting out which values have an impact and
which do not. It is the process of finding a model or function for distinguishing the data
into continuous real values instead of using classes. It is a data mining technique used to
predict a range of numeric values give a particular dataset. It analyses the relationship
between a target variable (dependent) and its predictor variable (independent).
Example: one could use regression to predict what temperature it will be on a particular
day, such 28º (regression uses numeric values)

@study_ingmadesimple Luca du Toit


WHAT IS THE DIFFERENCE BETWEEN REGRESSION AND CORRELATION?
Regression is a functional relationship between variables in order to make future
projections on events. Correlation determines whether variables are correlated and
determines the strength of their association. Correlation measures the degree of a
relationship between two variables (x and y) whereas regression is how one variable
affects another. Correlation quantifies the strength of the linear relationship between a
pair of variables, whereas regression expresses the relationship in the form of an equation.
ASSOCIATIONS/ ASSOCIATION RULE MINING/ MARKET BASKET ANALYSIS is a category of
data mining that establishes relationships about items that occur together in a given
record. It aims to find interesting relationships between variables in large databases.
CLUSTERING is an unsupervised machine learning-based algorithm that discovers groups
and structures in the data that are in some way or another "similar", without using known
structures in the data. It partitions a collection of things into segments whose members
share similar characteristics.
TAXONOMY FOR DATA MINING TASKS, METHODS, AND ALGORITHMS

@study_ingmadesimple Luca du Toit


LO 6 Understand the privacy issues, pitfalls, and myths of data mining.
PRIVACY ISSUES IN DATA MINING: Data that is collected, stored, and analysed in data
mining often contains information about real people. Such information may include
identification data, demographic data, financial data, purchase history, and other
personal data. Most of these data can be accessed through some third-party data
providers. There have been a number of instances in the past where companies have
shared their customer data with others without seeking their consent.

DATA MINING MISTAKES/ PITFALLS:


►selecting the wrong problem for data mining because not every business problem can
be solved with data mining – when there is no representative data, there cannot be a
data mining project
►ignoring what your sponsor thinks data mining is and what it really can and cannot do –
expectation management is key for successful data management projects
►beginning without the end in mind
►define the project around a foundation that your data cannot support – knowing what
the limitations of data are will help you craft feasible projects that deliver results and meet
expectations
►leaving insufficient time for data preparation – avoid proceeding into modelling until
after your data is properly processed
►looking only at aggregated results and not at individual records – avoid unnecessarily
aggregating and overly simplifying data to help data mining algorithms
►being sloppy about keeping track of the data mining procedure and results – success
requires a systematic and orderly planning, execution, and tracking/ recording of all data
mining tasks
►using data from the future to predict the future
►ignoring suspicious findings and quickly moving on – proper investigation of suspicious
findings can lead to pleasing discoveries
►starting with a high-profile complex project that will make you a superstar – the goal
should be to show incremental and continuous value added, as opposed to taking on a
large project that will consumer resources without producing any valuable outcomes
►running data mining algorithms repeatedly and blindly – one should know how to
transform the data and set the proper parameter values to obtain the best possible results
►ignore the subject matter experts – understanding the problem domain and the related
data requires a highly involved collaboration between the data mining and the domain
experts
►believing everything you are told about the data – validation and verification through a
critical analysis is key to understanding and processing of the data
►assuming that the keepers of the data will be fully on board with cooperation –
understanding and managing the politics is a key to identify, access, and properly
understand the data to produce a successful data mining project
►measuring your results differently from the way your sponsor measures them – producing
the results in a measure and format that appeals to the end user increases the likelihood
of true understanding and proper use of the data mining outcomes

@study_ingmadesimple Luca du Toit


►if you build it, they will come: don’t worry about how to serve it up – deployment is a
necessary last step in the data mining process where models are integrated into the
organisational decision support infrastructure for enablement of better and faster decision
making

DATA MINING MYTHS:


►data mining provides instant, crystal-ball-like predictions
Reality: data mining is a multistep process that requires deliberate, proactive design and
use
►data mining is not yet viable for mainstream business applications
Reality: the current state of the art is ready to go for almost any business type and/ or size
►data mining requires a separate, dedicated database
Reality: because of the advance in database technology, a dedicated database is not
required
►only those with advanced degrees can do data mining
Reality: newer web-based tools enable managers of all educational levels to do data
mining
►data mining is only for large firms that have lots of customer data
Reality: if the data accurately reflects the business or its customers, then any company
can use data mining

REFERENCES – the above summary is made using the following textbook:


R. Sharda, et el. 2018. Business Intelligence, Analytics, and Data Science: A Managerial
Perspective. Fourth Edition. Pearson.
PLEASE NOTE: I am selling the service provided in summarising this chapter and not the
intellectual property provided.

@study_ingmadesimple Luca du Toit

You might also like