Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms
Chapter 4 Predictive Analytics i Data Mining Process^j Methods^j and Algorithms
It aims at
forecasting future events by looking at past data in order to predict the future.
predictive
analytics
predictive
data mining text analytics web analytics
modelling
Data mining is defined as the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data stored in structured databases:
►process – implies that data mining comprises of many iterative steps
►nontrivial – means that some experimentation-type search or inference is involved
►valid – means that the discovered patterns should hold true on new data with a
sufficient degree of certainty
►novel – means that the patterns are not previously known to the user within the context
of the system being analysed
►potentially useful – means that the discovered patterns should lead to some benefit to
the user or task
►ultimately understandable – means that the pattern should make business sense that
leads to the user saying “it makes sense”
Data mining is a blend of multiple disciplines, including: statistics, artificial intelligence,
machine learning and pattern recognition, information visualisation, database
management and data warehousing, and management science and information
systems.
@study_ingmadesimple Luca du Toit
FACTORS THAT HAVE INCREASED THE POPULARITY OF DATA MINING:
►more intense competition on a global scale, driven by customers’ ever-changing needs
and wants in an increasingly saturated marketplace
►general recognition of the untapped value hidden in large data sources
►consolidation and integration of database records which enables a single view of
customers, vendors, transactions, etc
►consolidation of databases and other data repositories into a single location in the form
of a data warehouse
►the exponential increase in data processing and storage technologies
►significant reduction in the cost of hardware and software for data storage and
processing
►movement toward the demassification of business practices – the conversion of
information resources into nonphysical form
A company that effectively uses data mining tools and techniques can acquire and
maintain a strategic competitive advantage. Data mining offers organisations an
indispensable decision-enhancing environment to exploit new opportunities by
transforming data into a strategic weapon.
HOW DOES DATA MINING WORK?
Using existing and relevant data obtained from within and outside the organisation, data
mining builds models to discover patters among the attributes presented in the data set.
Models are the mathematical representations (linear or nonlinear relationships) that
identify the patterns among the attributes of the things described within the dataset.
Data mining seeks to identify the following major types of patterns:
►associations – this finds the commonly cooccurring groupings of things e.g. beer and
diapers (claims that men who go to a store to buy diapers are also likely to buy beer)
►predictions – this tells the nature of future occurrences of certain events based on what
has happened in the past e.g. forecasting the max temperature of a particular day
►clusters – this identifies natural groupings of things based on their known characteristics
e.g. assigning customers in different segments based on their demographics and past
purchase behaviours
►sequential relationships – this discovers time-ordered events e.g. predicting that an
existing banking customer who already has a checking account will open a savings
account followed by an investment account within a year
Data mining tasks can be classified into three main catgeories: prediction, association,
and clustering. Based on the way in which the patterns are extracted from the historical
data, the learning algorithms of data mining methods can be classified as either
supervised or unsupervised.
We must first determine what we want to do: are we trying to predict something or
describe something?
REGRESSION is a way of mathematically sorting out which values have an impact and
which do not. It is the process of finding a model or function for distinguishing the data
into continuous real values instead of using classes. It is a data mining technique used to
predict a range of numeric values give a particular dataset. It analyses the relationship
between a target variable (dependent) and its predictor variable (independent).
Example: one could use regression to predict what temperature it will be on a particular
day, such 28º (regression uses numeric values)