Data Science Analytics For Ordinary People PDF
Data Science Analytics For Ordinary People PDF
Jeffrey Strickland
ISBN 978-1-329-28062-5
90000
ID: 16909563
9 781329 280625 www.lulu.com
Data Science and Analytics for Ordinary People
By
Jeffrey S. Strickland
Data Science and Analytics for Ordinary People
All rights reserved. This book or any portion thereof may not be
reproduced or used in any manner whatsoever without the express
written permission of the publisher except for the use of brief quotations
in a book review.
ISBN 978-1-329-28062-5
Criticism ................................................................................................... 18
Tools ........................................................................................................ 20
References ............................................................................................... 20
A DANGEROUS GAME WE PLAY .......................................................................... 22
What are our assumptions? .................................................................... 22
Can we really predict the Unpredictable?................................................ 23
What are models really?.......................................................................... 23
Can models be “good enough”? .............................................................. 24
Did we build the “Right” Model? .............................................................. 25
Is it still dangerous?.................................................................................. 26
PREDICTIVE ANALYTICS: THE SKILLS AND TOOLS ........................................................ 27
What is Predictive Analytics? ................................................................... 27
Statistical Modeling & Tools..................................................................... 28
Data Processing ........................................................................................ 29
PART III – MODELS AND MODELS AND MORE MODELS................................ 30
HOW DO YOU BUILD A MODEL? ............................................................................ 30
Define the problem ................................................................................... 31
Define the Business case .......................................................................... 31
Define the Model Objective ...................................................................... 31
Determine the requirements .................................................................... 31
Gather the Data ....................................................................................... 32
Process the Data....................................................................................... 32
Build the Model ........................................................................................ 32
Interpret the Model Results ...................................................................... 32
Validate the Model for Production ........................................................... 33
Perform an Economic Analysis ................................................................. 33
Present the Results ................................................................................... 33
Follow-up .................................................................................................. 33
Conclusion ................................................................................................ 34
WHAT ARE PREDICTIVE MODELS ANYWAY? ............................................................ 35
How old is Data Science? .......................................................................... 35
What is Predictive Modeling?................................................................... 36
What is a Predictive Model?..................................................................... 36
Examples of Predictive Models ................................................................. 37
What Predictive Models have I Built?....................................................... 40
References ................................................................................................ 41
MATHEMATICAL MODELING ............................................................................... 42
Conclusion................................................................................................ 43
IF YOU BUILD IT THEY WILL BUY... ....................................................................... 44
A Field of Dreams..................................................................................... 44
Rocker Boxes ............................................................................................ 45
Propensity Models ................................................................................... 46
UPLIFT MODELS IN PLAIN ENGLISH, SORT OF... ....................................................... 48
WHAT’S THE DIFFERENCE? ................................................................................. 50
Descriptive models ................................................................................... 50
Predictive models..................................................................................... 50
Prescriptive models .................................................................................. 51
Analytics .................................................................................................. 51
Terms ....................................................................................................... 51
WHAT IS A PROPENSITY MODEL?......................................................................... 52
Model 1: Predicted customer lifetime value ............................................ 53
Model 2: Predicted share of wallet .......................................................... 53
Model 3: Propensity to engage ................................................................ 53
Model 4: Propensity to unsubscribe ........................................................ 54
Model 5: Propensity to buy ...................................................................... 54
Model 6: Propensity to churn .................................................................. 55
Conclusion................................................................................................ 55
WHAT ARE CLUSTERING MODELS? ....................................................................... 56
Predictive model 1: Behavioral clustering ............................................... 56
Predictive model 2: Product based clustering (also called category based
clustering) ................................................................................................ 57
Predictive model 3: Brand based clustering ............................................ 58
WHAT IS A SIMULATION MODEL? ........................................................................ 59
Monte Carlo Simulation Model ................................................................ 60
Dynamic Simulation Models ..................................................................... 60
Discrete Time Simulation Models ............................................................. 61
Discrete Event Simulation Models ............................................................ 61
Simulation Architectures .......................................................................... 62
Conclusion ................................................................................................ 62
WHAT ARE STOCHASTIC MODELS ......................................................................... 63
Is there a chance?..................................................................................... 63
Stochastic ................................................................................................. 63
Deterministic ............................................................................................ 64
References ................................................................................................ 64
WHAT ARE NEURAL NETWORKS? ......................................................................... 65
Are you neural? ........................................................................................ 65
References ................................................................................................ 68
WHAT IS DISCRETE EVENT SIMULATION? ............................................................... 70
WHAT IS PREDICTIVE ANALYTICS MISSING? ............................................................ 72
Advantages of using simulation ............................................................... 72
What are the Insurance and Financial industries missing? ...................... 73
What could they do differently? ............................................................... 73
We are doing just fine .............................................................................. 74
The One-eye Man in the Kingdom of the Blind ......................................... 74
Break the rules! ........................................................................................ 75
A little at a time ........................................................................................ 75
Will it crash? ............................................................................................. 75
Conclusion ................................................................................................ 76
PART IV – STATISTICAL MATERS ................................................................... 77
WHY YOU SHOULD CARE ABOUT STATISTICS ............................................................ 77
Who is Benjamin Disraeli? ........................................................................ 77
Can we lie statistics? ................................................................................ 78
Can we lie with data? .............................................................................. 78
Can we lie with statistical models? .......................................................... 79
Can we be confident in models? .............................................................. 79
Conclusion................................................................................................ 80
WHY IS ANALYSIS LIKE HIKING? ........................................................................... 81
Planning ................................................................................................... 81
Execution ................................................................................................. 82
Relating your Story .................................................................................. 83
Conclusion................................................................................................ 83
12 WAYS NOT TO PLEASE YOUR CUSTOMER .......................................................... 85
ANALYTICS AND STATISTICS: IS THERE A DIFFERENCE?............................................... 89
What is Statistics? ................................................................................... 89
What is Analytics? ................................................................................... 90
Conclusion................................................................................................ 91
References ............................................................................................... 92
ARE STATISTICIANS A DYING BREED? .................................................................... 93
STATISTICS IS OBSOLETE ..................................................................................... 96
Why do we no longer need Statistics? ..................................................... 96
What’s wrong with this Picture? ............................................................. 97
How do we fix the Picture? ...................................................................... 99
MATH, PHYSICS AND CHEMISTRY ARE OBSOLETE................................................... 100
PART V – DATA SCIENCE CONCERNS........................................................... 102
DATA SCIENTISTS ARE DEAD, LONG LIVE DATA SCIENCE! ........................................ 102
Call it what it is ...................................................................................... 102
Mimic Mathematical Sciences? ............................................................. 103
A Data Science Taxonomy?.................................................................... 104
Fun with Machine Learning ................................................................... 104
Conclusion.............................................................................................. 105
SO YOU THINK YOU ARE A DATA SCIENTIST? .......................................................... 106
Assistant Scientist I/II ............................................................................. 108
Scientist I/II ............................................................................................. 108
Senior Scientist I/II .................................................................................. 109
Principal Scientist I/II .............................................................................. 109
5 SIGNS THAT YOU MIGHT BE A DATA SCIENTIST .................................................... 111
HOW CAN I BE A DATA SCIENTIST? ...................................................................... 113
Johns Hopkins University – Data Science ....................................................... 113
University of Illinois at Urbana-Champaign – Data Mining ............................ 113
SAS and Hadoop for Learning ......................................................................... 114
The author would like to thank colleagues Adam Wright, Adam Miller,
Matt Santoni. And Olaf Larson of Clarity Solution Group. Working with
them over the past two years has validated the concepts presented
herein.
xvii
Preface
Ordinary people included anyone who is not a Geek like myself. This
book is written for ordinary people. That includes manager, marketers,
technical writers, couch potatoes and so on.
The book is organized into nine logical sections: Big Data, Analytics,
Models, Statistical Matters, Data Science Concerns, Applications,
Operations Research, Tools and Advice.
xix
PART I – Big Data
Big data: The next frontier for innovation, competition, and
productivity.
The term “big data” was included in the most recent quarterly online
update of the Oxford English Dictionary (OED). So now we have a most
1
authoritative definition of what recently became big news: “data of a
very large size, typically to the extent that its manipulation and
management present significant logistical challenges.” The term,
however, may have appeared as early as 1944. [1]
Background
Predictive modeling and analytics are getting a lot of attention these
days, perhaps in light of corporate America’s “discovery” of Big Data.
Well, I have news for you, “big data” has been around longer than you
think and corporate anyone did not “discover it”.
2
scenarios included more than one threat missile. I think we once had a
data file that was on the order of 50 terabytes.
Surveillance data from space also exceeds most people’s concept of big
data, and we have been collecting it since the Cold War. I once had to
write an algorithm that would take the enormous amount of data and
filter it (along with some data reduction) using things like value of
information (VOI), or else it would over-saturate our networks and be
useless.
Conclusion
The point is, “bigger data” has been around a lot longer than “big data”.
The growth of data corresponds with the growth in storage capacity,
which corresponds to the growth in technology, and if NASA had not
raced to get men on the moon, we might still be drawing histograms on
graph paper.
Works Cited
1. Gil Press, “A Very Short History Of Big Data”, Forbes, May 9, 2013
2. Strickland, J. (2011). Using Math to Defeat the Enemy: Combat
Modeling for Simulation. Lulu.com. ISBN 978-1-257-83225-23
3
BIG DATA, small data, Horses and
Unicorns
Every so often a term becomes so beloved by media that it moves from
‘instructive’ to ‘hackneyed’ to ‘worthless,’ and Big Data is one of those
terms….
— Roger Ehrenberg
Somehow, I missed the funeral, but BIG DATA is dead. The phrase “big
data” now has no value ... it's completely void of meaning. For those of
us who have been around long enough, the mere mention of the phrase
is awful enough to induce a big data migraine — please pass the big data
Tylenol, no not strong enough, the big data Vicodin.
4
The end came quickly…
After the horse flies finished their work, the worms came and the horses
got sicker. I do not have anything against venders, I mean worms, but
they wouldn’t let a downed horse recover. Then one day the horses
died…all of them. We were not left with ponies—they had been
forgotten. So it would seem that we were left no horses of any kind.
5
Part II - Analytics
Types
Business analytics is comprised of descriptive, predictive and
prescriptive analytics, these are generally understood to be
6
descriptive modeling, predictive modeling, and prescriptive
modeling.
Descriptive analytics
Predictive analytics
Prescriptive analytics
7
advantage of a future opportunity or mitigate a future risk and
shows the implication of each decision option. Prescriptive
analytics can continually take in new data to re-predict and re-
prescribe, thus automatically improving prediction accuracy and
prescribing better decision options. Prescriptive analytics ingests
hybrid data, a combination of structured (numbers, categories) and
unstructured data (videos, images, sounds, texts), and business
rules to predict what lies ahead and to prescribe how to take
advantage of this predicted future without compromising other
priorities
Applications
Although predictive analytics can be put to use in many
applications, I list a few here:
8
Predictive Analytics & Modeling
Predictive analytics—sometimes used synonymously with predictive
modeling—is not synonymous with statistics, often requiring
modification of functional forms and use of ad hoc procedures, making
it a part of data science to some degree. It does however, encompasses
a variety of statistical techniques for modeling, incorporates machine
learning, and utilizes data mining to analyze current and historical facts,
making predictions about future.
9
travel (McDonald, 2010), healthcare (Stevenson, 2011), pharmaceuticals
(McKay, 2009), defense (Strickland, 2011) and other fields.
Definition
Predictive analytics is an area of Data Science that deals with extracting
information from data and using it to predict trends and behavior
patterns. Often the unknown events of interest is in the future, but
predictive analytics can be applied to any type of unknown whether it
be in the past, present or future. For example, identifying suspects after
a crime has been committed, or credit card fraud as it occurs (Strickland
J., 2013). The core of predictive analytics relies on capturing
relationships between explanatory variables and the predicted variables
from past occurrences, and exploiting them to predict the unknown
outcome. It is important to note, however, that the accuracy and
usability of results will depend greatly on the level of data analysis and
the quality of assumptions.
Not Statistics
Predictive analytics uses statistical methods, but also machine learning
algorithms, and heuristics. Though statistical methods are important,
the Analytics professional cannot always follow the “rules of statistics to
the letter.” Instead, the analyst often implements what I call “modeler
judgment”. Unlike the statistician, the analytics professional—akin to
the operations research analyst—must understand the system,
business, or enterprise where the problem lies, and in the context of the
business processes, rules, operating procedures, budget, and so on,
make judgments about the analytical solution subject to various
constraints. This requires a certain degree of creativity, and lends itself
to being both a science and an art,
Types
Generally, the term predictive analytics is used to mean predictive
modeling, “scoring” data with predictive models, and forecasting.
However, people are increasingly using the term to refer to related
analytical disciplines, such as descriptive modeling and decision
modeling or optimization. These disciplines also involve rigorous data
analysis, and are widely used in business for segmentation and decision
making, but have different purposes and the statistical techniques
underlying them vary.
Predictive models
11
referred to as “out of [training] sample” units. The out of sample bear
no chronological relation to the training sample units. For example, the
training sample may consists of literary attributes of writings by
Victorian authors, with known attribution, and the out-of sample unit
may be newly found writing with unknown authorship; a predictive
model may aid the attribution of the unknown author. Another example
is given by analysis of blood splatter in simulated crime scenes in which
the out-of sample unit is the actual blood splatter pattern from a crime
scene. The out of sample unit may be from the same time as the training
units, from a previous time, or from a future time.
Descriptive models
Decision models
12
Applications
Although predictive analytics can be put to use in many applications, we
outline a few examples where predictive analytics has shown positive
impact in recent years.
13
Collection analytics
Every portfolio has a set of delinquent customers who do not make their
payments on time. The financial institution has to undertake collection
activities on these customers to recover the amounts due. A lot of
collection resources are wasted on customers who are difficult or
impossible to recover. Predictive analytics can help optimize the
allocation of collection resources by identifying the most effective
collection agencies, contact strategies, legal actions and other strategies
to each customer, thus significantly increasing recovery at the same time
reducing collection costs.
Cross-sell
Customer retention
14
Predictive analytics can also predict this behavior, so that the company
can take proper actions to increase customer activity.
Direct marketing
Fraud detection
Fraud is a big problem for many businesses and can be of various types:
inaccurate credit applications, fraudulent transactions (both offline and
online), identity thefts and false insurance claims. These problems
plague firms of all sizes in many industries. Some examples of likely
victims are credit card issuers, insurance companies (Schiff, 2012), retail
merchants, manufacturers, business-to-business suppliers and even
services providers. A predictive model can help weed out the “bads” and
reduce a business's exposure to fraud.
The Internal Revenue Service (IRS) of the United States also uses
predictive analytics to mine tax returns and identify tax fraud (Schiff,
2012).
15
Recent advancements in technology have also introduced predictive
behavior analysis for web fraud detection. This type of solution utilizes
heuristics in order to study normal web user behavior and detect
anomalies indicating fraud attempts.
Often the focus of analysis is not the consumer but the product,
portfolio, firm, industry or even the economy. For example, a retailer
might be interested in predicting store-level demand for inventory
management purposes. Or the Federal Reserve Board might be
interested in predicting the unemployment rate for the next year. These
types of problems can be addressed by predictive analytics using time
series techniques (see Chapter 18). They can also be addressed via
machine learning approaches which transform the original time series
into a feature vector space, where the learning algorithm finds patterns
that have predictive power.
Risk management
Underwriting
Many businesses have to account for risk exposure due to their different
services and determine the cost needed to cover the risk. For example,
auto insurance providers need to accurately determine the amount of
premium to charge to cover each automobile and driver. A financial
company needs to assess a borrower's potential and ability to pay before
granting a loan. For a health insurance provider, predictive analytics can
16
analyze a few years of past medical claims data, as well as lab, pharmacy
and other records where available, to predict how expensive an enrollee
is likely to be in the future. Predictive analytics can help underwrite
these quantities by predicting the chances of illness, default, bankruptcy,
etc. Predictive analytics can streamline the process of customer
acquisition by predicting the future risk behavior of a customer using
application level data. Predictive analytics in the form of credit scores
have reduced the amount of time it takes for loan approvals, especially
in the mortgage market where lending decisions are now made in a
matter of hours rather than days or even weeks. Proper predictive
analytics can lead to proper pricing decisions, which can help mitigate
future risk of default.
Analytical Techniques
The approaches and techniques used to conduct predictive analytics can
broadly be grouped into regression techniques and machine learning
techniques. [Condensed]
17
Regression techniques
Neural networks
Multilayer Perceptron (MLP)
Radial basis function (RBF)
Naïve Bayes
K-Nearest Neighbor algorithm (k-NN)
Criticism
There are plenty of skeptics when it comes to computers and algorithms
abilities to predict the future, including Gary King, a professor from
Harvard University and the director of the Institute for Quantitative
Social Science. People are influenced by their environment in
innumerable ways. Trying to understand what people will do next
assumes that all the influential variables can be known and measured
accurately. “People’s environments change even more quickly than they
18
themselves do. Everything from the weather to their relationship with
their mother can change the way people think and act. All of those
variables are unpredictable. How they will impact a person is even less
predictable. If put in the exact same situation tomorrow, they may make
a completely different decision. This means that a statistical prediction
is only valid in sterile laboratory conditions, which suddenly isn't as
useful as it seemed before.” (King, 2014)
19
Tools
Tools change often, but SAS appears to be the industry standard, and I
relay heavily on SAS Enterprise Modeler for my job. Be that as it may, I
use R a great deal and find SPSS (particularly SPSS Modeler) useful for
some things. Personally, I prefer R.
References
Barkin, E. (2011). CRM + Predictive Analytics: Why It All Adds Up. New
York: Destination CRM. Retrieved 2014, from
http://www.destinationcrm.com/Articles/Editorial/Magazine-
Features/CRM---Predictive-Analytics-Why-It-All-Adds-Up-
74700.aspx
20
Strickland, J. (2013). Introduction toe Crime Analysis and Mapping.
Lulu.com. Retrieved from http://www.lulu.com/shop/jeffrey-
strickland/introduction-to-crime-analysis-and-
mapping/paperback/product-21628219.html
21
A Dangerous Game We Play
We, the scientists who perform predictive modeling, hold a crystal ball
in our hands, making concise and accurate predictions based on quality
data and scientific procedures. Or, do we?
22
Perhaps this is why Professor Box said, “all models are wrong, but some
are useful.”
Well, first the model are not “bad”. They are just models. Modeling in
general is to pretend that one deals with a real thing while really working
with an imitation. In operations research the imitation is a computer
model of the simulated reality. A flight simulator on a PC is also a
computer model of some aspects of the flight: it shows on the screen
the controls and what the “pilot” (the youngster who operates it) is
supposed to see from the “cockpit” (his armchair). A statistical model
predicting who has the propensity to buy a certain product, or a machine
learning algorithm with the same objective are also models.
The question we often ask is, “How good is your model”? What we
should ask is, “Is your model's performance better than no model at all?”
The answer to the former is probably, “Not very good.” The answer to
the latter is “Probably better.” The graph below represents the
performance of a statistical model with predicted values, actual values
and random values (or no model at all). Assuming the two curves above
the diagonal line (representing no model) are statistically different from
random chance, then the model would be better than having no model.
24
Did we build the “Right” Model?
Clearly a model in this instance is better than no model. But this requires
qualification: is the model valid? This is different than "is the model a
good one". It is not a matter of whether we built the model right,
although that is equally important. No, the question is: “Did we build the
right model?" If our statistical implementations were correct, then the
model depicted in the figure would seem to be a good model. It captures
about 70% of the audience at the 7th pentile. It is relatively smooth and
is about 2.5 times better that no model at the 7th pentile. The question
remains: is it the right model?
Would I use this model to predict? Yes, and I already do so. It is not a
perfect model, but it is a useful one. The model addresses the client’s
business case, and is used to predict the propensity of a customer to
engage in action X (I cannot elaborate on the client or model any more
than this). Yet, though it is used for prediction, it will still occasionally
“miss the mark” and be wrong, due to inherent statistical errors and to
data that may not be “perfect.” So the equation for a useful model might
look like:
25
Is it still dangerous?
Thus, prediction is a dangerous game we play. We do not really know for
certain what will occur in the months to come, but an educated guess—
based on a model—is probably much better than a random shot in the
dark. Our crystal ball may not work so well after all, but we can still dimly
see what might come to pass.
26
Predictive Analytics: the skills and
tools
Several weeks ago I posted an article called, What is Predictive Analytics,
describing what it is. In the present article, I want to talk about the skills
and tools that one should have to perform predictive analytics.
The tools one would/could use are a myriad and are often the tools our
company or customer has already deployed. SAS modeling products are
well-established tools of the trade. These include SAS Statistics, SAS
Enterprise Guide, SAS Enterprise Modeler, and others. IBM made its mark
on the market with the purchase of Clementine and its repackaging as
IBM SPSS Modeler. There are other commercial products like Tableau. I
have to mention Excel here, for it is all many will have to work with. But
you have to go beyond the basics and into its data tools, statistical
analysis tools and perhaps its linear programming Solver, plus be able to
construct pivot tables, and so on.
Today, there a multitude of open source domain tools that have become
popular, including R and its GUI, R-Studio; the S programming package;
28
and the Python programming language (the most used language in
2014). R, for example, is every bit as good as its nemesis SAS, but I have
yet to get it to leverage the enormous amount of data that I have with
SAS. Part of this is due to server capacity and allocation, so I really don't
know how much data R can handle.
Data Processing
For the forgoing methods, data is necessary and it will probably not be
handed to you on a silver platter ready for consumption. It may be
“dirty”, in the wrong format, incomplete, or just not right. Since this is
where you may spend an abundant amount of time, you need the skill
at tools to process data. Even if this is a secondary task--it has not been
for me--you will probably need to know Structured Query Language
(SQL) and something bout the structure of databases.
If you do not have clean, complete, and reliable data to model with, you
are doomed. You may have to remove inconsistencies, impute missing
values, and so on. Then you have to analyze the data, perform data
reduction, and integrate the data so that it is ready for use. Modeling
with “bad” data results in a “bad” model!
29
Part III – Models and Models
and More Models
If I had an hour to solve a problem I'd spend 55 minutes thinking about
the problem and 5 minutes thinking about solutions
—Albert Einstein
30
Define the problem
Defining the problem is necessary. Defining the “right” problem is
absolutely critical, otherwise you're just wasting everyone's time.
However, the customers you are working for may not know how to
express the real problem or may not know what the problem really is.
The Operations Research (OR) Analyst must ask the right questions and
draw the right problem out form where it may be hiding.
31
Gather the Data
If data is necessary, it will probably not be handed to you on a silver
platter ready for consumption. It may be “dirty”, in the wrong format,
incomplete, of just not right. If you have fully identified the
requirements, you will already know what has to be accomplished at this
stage.
32
that fly in the face of conventional wisdom. The validation step is an
extremely important piece of the puzzle whereby a company may have
to perform a series of controlled experiments to confirm for sure that “x
contributes to the explanation of y”. This leap of faith extends to
potentially creating the necessary data in the first place if it doesn’t
already exist and may involve short-term losses for the promise of a
much larger long-term gain.
Follow-up
Too often we fix the problem and go back to our lives or onto another
problem, but following up with the customer to evaluate the
effectiveness of the solution you provided is just as important as the
solution that you provided. It is especially important if you want to ever
see that customer again.
33
Conclusion
So there is my view of model building. I have built just about every kind
of model: linear programs, nonlinear programs, machine learning
models, statistical models, Markov transition models, and so on. I have
built models of unmanned aerial vehicles, space launch vehicles,
communications networks, software systems, propensity to buy a
product, call center operations, combat phenomena, and so on. And I
have, more than I would like to admit, built the wrong model. Hopefully,
this will help the modeler step out on the right foot, and help the model
recipient know what to look for when asking for a modeling solution.
34
What are Predictive Models
Anyway?
Predictive modeling does not lie solely in the domain of Big Data
Analytics or Data Science. I am sure that there are a few “data scientist”
who think they invented predictive modeling. However, predictive
modeling has existed for a while and at least since World War II. In
simple terms, a predictive model is a model with some predictive power.
I will elaborate on this later.
I have been building predictive models since 1990. Doing the math, 2015
– 1990 = 25 years, I have been engaged in the predictive modeling
business longer that data science has been around. My first book on the
subject, "Fundamentals of Combat Modeling (2007), predates the "Data
Science" of 2009 (see below).
35
ground, but the topic reemerges in 2001 when William S. Cleveland
publishes “Data Science: An Action Plan for Expanding the Technical
Areas of the Field of Statistics.” But it is really not until 2009 that data
science gains any significant following and that is also the year that Troy
Sadkowsky created the data scientists group on LinkedIn as a companion
to his website, datasceintists.com (which later became
datascientists.net). [1]
36
Examples of Predictive Models
The taxonomy of predictive models represented here is neither
exhaustive of exclusive. In other words, there are other ways to classify
predictive models, but here is one.
37
Physical models. These models are based on physical phenomena. They
include 6-DoF (Degrees of Freedom) flight models, space flight models,
missile models, combat attrition models (based on physical properties
of munitions and equipment).
38
Machine Leaning Models. These include auto neural networks (ANN),
support vector machines, classification trees, random forests, etc. These
are based on data, but unlike statistical models, they “learn” from the
data.
Weather models. These are forecasting models based on data, but the
amount of data, the short interval of prediction windows and the
physical phenomena involved make them much different that statistical
forecasting models.
39
Statistical Models. The first two examples, Time Series and Regression
models, are statistical models. However, I list it separately because many
do not realize that statistical models are mathematical models, based on
mathematical statistics. Things like means and standard deviations are
statistical moments, derived from mathematical moment generating
functions. Every statistic in Statistics is based on a mathematical
function.
Models I have consulted on include the NASA Ares I Crew Launch Vehicle
Reliability and Launch Availability; The Extended Range Multi-Purpose
(ERMP) Unmanned Aerial Vehicle RAM Model, The Future Combat
Systems (FCS) C4ISR family of models; FCS Logistic Decision Support
System Test-Bed Model; Unspecified models (unspecified because they
are classified).
References
Press, G. “A Very Short History Of Data Science”, Forbes, May 28, 2013
@ 7:09 AM, Retrieved 05-29-2015.
41
Mathematical Modeling
Again, there are prerequisites like differential and integral calculus and
linear algebra. Multivariate calculus is a plus, particularly if you'll be
doing models involving differential equations and nonlinear
optimization. The skills you need to acquire beyond the basics include
mathematical programming--linear, integer, mixed, and nonlinear. Goal
programming, game theory, Markov chains, and queuing theory, to
name a few, may be required. Mathematical studies in real and complex
analysis, and linear vector spaces, as well as abstract algebraic concepts
like group, fields and rings, can reveal the foundational theory.
42
and Analytica. Otave is an open-source mathematical modeling tool that
reads MATLAB code and there are add-on GUI environments (like R-
studio for R) floating around in hyperspace. I recently discovered the
power of Scilab and the world of modules (packages) that are available
for this open-source gem.
For simulation, Simulink works "on top of" MATLAB functions/code for a
variety of simulation models. I wrote the book "Missile Flight
Simulation", using MATLAB and Simulink. ExtendSim is an excellent tool
for discrete event simulation and the subject of my book "Discrete Event
Simulation using ExtendSim". In Scilab, I have used Xcos for discrete
event simulation and Quapro for linear programming. Both are featured
in my next book.
There is a general analytics tool that I do not know much about yet.
BOARD, in its newest release, boasts a predictive analytics capability. I
will be speaking on predictive analytics at the BOARD User Conference
during April 13th-14th in San Diego. Again, I would be remiss not to
mention Excel, and particularly the Solver add-in for mathematical
programming. Another 3rd-party add-in to consider to @Risk.
Conclusion
If you aspire to become an analytics consultant or scientist, you have a
lot of open-source tools, free training and online tutorials at your
fingertips. If you are already working in analytics, you can easily
specialize in predictive analytics. If you are already working in predictive
analytics, you have what you need to become an expert. All of the tools
will either work with your PC's native processing power or through a
virtual machine, for example, when using Hadoop, or remote server.
43
If You Build It They Will Buy...
Most everyone knows that using a propensity model to predict
customers’ likelihood of buying, engaging, leaving and so on, produces a
larger target audience. That is too bad... for what "everyone knows" is
actually incorrect.
A Field of Dreams
In the movie File of Dreams, actor Kevin Costner (whom people often
confuse me with), plays an Iowa corn farmer struggling to pay the
mortgage on his homestead. He is out farming one day and hears a voice,
like a loud whisper in the wind:
Models are not like that. In fact, they reduce the size of the customer
population that has a high propensity to do whatever it is that you are
modeling. If you heard a voice in this context it would say:
44
Wow, that is really bad! So why do we use models?
Actually, it is not bad—that is what we want models to do. But you must
think I am outside my mind for saying so. However, modeling the
propensity of customers to buy, engage, etc. is like prospecting for gold.
There is a lot of rock and sand that has to be filtered out to get the “pure”
gold dust.
Rocker Boxes
A rocker box (also known as a cradle) is a gold mining implement for
separating alluvial placer gold from sand and gravel which was used in
placer mining in the 19th century. It consists of a high-sided box, which
is open on one end and on top, and was placed on rockers.
The inside bottom of the box is lined fles (usually glass of plastic) and a
usually a carpet similar to a sluice. On top of the box is a classifier sieve
or grizzly (usually with half-inch or quarter-inch openings) which
screens-out larger pieces of rock and other material, allowing only finer
sand and gravel through. Between the sieve and the lower sluice section
is a baffle, which acts as another trap for fine gold and also ensures that
the aggregate material being processed is evenly distributed before it
enters the sluice section. It sits at an angle and points towards the closed
back of the box. Traditionally, the baffle consisted of a flexible apron or
“blanket riffle” made of canvas or a similar material, which had a sag of
45
about an inch and a half in the center, to act as a collection pocket for
fine gold. Later rockers (including most modern ones) dispensed with the
flexible apron and used a pair of solid wood or metal baffle boards. These
are sometimes covered with carpet to trap fine gold. The entire device
sits on rockers at a slight gradient, which allows it to be rocked side to
side.
Propensity Models
A propensity model is like a rocker box: what comes out with a high
propensity score is your gold dust customers—the ones that are rich
with propensity to do whatever it is that you are modeling. The debris
scores low in the model. If you have 10,000,000 customers, chances are
that only 10,000 (or fewer) have a high propensity to buy product X. But
those customer are your gold dust. (A friend, Steve Cartwright, just
posted an article on microtargeting.) I also use the analogy of filtered
funnel, where the top has a large opening and the bottom a much
smaller one, and where my "gold dust" customers are small enough to
pass through the tiny funnel opening at the bottom. Many go in, but few
come out.
46
It is incumbent on us modelers to manage the expectations of our
customers. Though quite a few already understand this model
phenomenon, we must establish with the customer what to anticipate
as an outcome at the binning of a model project, even before we think
about data, model functional form and so on.
So, the voice that said, "If you build it they will come", was not talking to
me (even though I strongly resemble Kevin Costner). Instead, my voice
said, "If you build it fewer, but richer will come."
47
Uplift Models in Plain English, sort
of...
I have used the term "Uplift" when speaking of class of predictive
models, but now I want to provide a plain English description. Before I
do that , however, do not get your hopes up. Uplift models are hard to
construct and even harder to maintain--they are not a "cure all" solution
to marketing problems.
48
netlift is equal to responses from the treated group minus responses
from the control group during the campaign acquisition window,
roughly.
Now for a point about data collection. You do not necessarily have to
track who is responding to the campaign. You only need to see who
purchases the product you are marketing during the acquisition window.
Then for both the treated and control group, if a purchase is made during
the campaign acquisition window it counts as a response. If your
customer database is set up so that you track products on a regular
basis--daily, weekly, monthly, and so on--then if the acquisition window
is 60 days, you merely look at the product count before the 60-day
window and the product count after the 60-day window, and measure
the difference. Without going into the technical detail, the modeler or
marketing analyst usually does this, and only one response is counted
regardless of the number of products purchased. The non-technical
reason for this is that we are measuring "response to a campaign," not
"number of products purchased.
If you are eager for the technical details, or if you are suffering from
insomnia and wish to sleep, download my free e-Book entitled
"Predictive Modeling and Analytics". Uplift modeling is found in Chapter
17. Of course you are always welcome to purchase a copy at Lulu.com
or Amazon.com, thus contributing to my gas money.
49
What’s the Difference?
I have been writing a little bit about analytics and models. Here is the
differences between three types of models I have discussed that may
need clarification.
Descriptive models
These models describe what has already occurred and in analytics they
are usually statistical. They help the business know how thing are going.
Clustering models might fall into this category.
Predictive models
Predictive models tell us what will probably happen in the future as a
result of something that has already happened. They help the business
forecast future behavior and results. These may be comprised of
statistical models, like regression, or machine learning models, like
neural networks. "Propensity to buy" models fall into this category.
50
Prescriptive models
These models are like descriptive-predictive models. They go an extra
step and tell us why and help the business prescribe a next course of
action. That is, they not only tell us what probably will happen, but also
why it may happen. Form here we can look at what's next, for if we know
the why, we can better know how to mold things for the future.
Analytics
According to The Institute for Operations Research and the
Management Science (INFORMS) Analytics is defined as the scientific
process of transforming data into insight for making better decisions.
To perform good analytics, companies should employ all three types of
modeling in order to cover the entire spectrum of their business.
Terms
I use words like “probably” and “may” because there is always a bit of
uncurtaining in predictive and prescriptive models. You usually hear
about this in phrases like, “95 percent confidence”, or perhaps in terms
like, “Type II error.”
51
What is a Propensity Model?
Predictive analytics was a topic of one of my recent posts, and I have
been often asked how specifically marketers can use predictions to
develop more profitable relations with their customers. Yesterday, I
talked specifically about predictive modeling. There are three types of
predictive models marketers should know about, but I will only talk
about the first one in this article:
Propensity models are what most people think of when they hear
“predictive analytics”. Propensity models make predictions about a
customer’s future behavior. With propensity models you can anticipate
a customers’ future behavior. However, keep in mind that even
propensity models are abstractions and do not necessarily predict
absolute true behavior (see, “What is Predictive Modeling?”). I’ll go
through six examples of propensity models to explain the concept.
52
Model 1: Predicted customer lifetime value
CLV (Customer Lifetime Value) is a prediction of all the value a business
will derive from their entire relationship with a customer. The Pareto
Principle states that, for many events, roughly 80% of the effects come
from 20% of the causes. When applied to e-commerce, this means that
80% of your revenue can be attributed to 20% of your customers. While
the exact percentages may not be 80/20, it is still the case that some
customers are worth a whole lot more than others, and identifying your
“All-Star” customers can be extremely valuable to your business.
Algorithms can predict how much a customer will spend with you long
before customers themselves realizes this.
At the moment a customer makes their first purchase you may know a
lot more than just their initial transaction record: you may have email
and web engagement data for example, as well as demographic and
geographic information. By comparing a customer to many others who
came before them, you can predict with a high degree of accuracy their
future lifetime value. This information is extremely valuable as it allows
you to make value based marketing decisions. For example, it makes
sense to invest more in those acquisition channels and campaigns that
produce customers with the highest predicted lifetime value.
53
example, a propensity to engage model can predict how likely it is that
a customer will click on your email links. Armed with this information
you can decide not to send an email to a certain “low likelihood to click”
segment.
For example, a “propensity to buy a new vehicle” model built with only
data the automotive manufacturer has in their database can be used to
predict percent of sales. By incorporating demographic and lifestyle data
from third parties, the accuracy of that model can be improved. That is,
if the first model predicts 50% sales in the top five deciles (there are ten
deciles), then the latter could improve the result to 70% in the top five
deciles.
54
Model 6: Propensity to churn
Companies often rely on customer service agents to "save" customers
who call to say they are taking their business elsewhere. But by this time,
it is often too late to save the relationship. The propensity to churn
model tells you which active customers are at risk, so you know which
high value, at risk customers to put on your watch list and reach out.
Armed with this information, you may be able to save those customers
with preemptive marketing programs designed to retain them.
Conclusion
Predictive analytics models are great, but they are ultimately useless
unless you can actually tie them to your day-to-day marketing
campaigns. This leads me to the first rule of predictive analytics:
It is better to start with just one model that you use in day-to-day
marketing campaigns than to have 10 models without the data being
actionable in the hands of marketers.
55
What Are Clustering models?
Clustering is the predictive analytics term for customer segmentation.
Clustering, like classification, is used to segment the data. Unlike
classification, clustering models segment data into groups that were not
previously defined. Cluster analysis itself is not one specific algorithm,
but the general task to be solved. It can be achieved by various
algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them.
With clustering you let the algorithms, rather than the marketers, create
customer segments. Think of clustering as auto-segmentation.
Algorithms are able to segment customers based on many more
variables than a human being ever could. It’s not unusual for two clusters
to be different on 30 customer dimensions or more. In this article I will
talk about three different types of predictive clustering models.
57
Product based clustering algorithms discover what different groupings
of products people buy from. See the example below of a category (or
product) based segment or cluster. You can see people in one customer
segment ONLY buy Pinot Noir, whereas those in another customer
segment buy different types of Varietal products, such as Champagne,
Chardonnay, Pinot Grigio and Prosecco – but never Cabernet Sauvignon,
Malbec or Espumante. This is useful information when deciding which
product offers or email content to send to each of these customer
segments.
58
What is a Simulation Model?
The question is actually misleading. A simulation can be comprised of
one model or many interacting models. For instance we might take a
statistical model and simulate it using random variates. In this case we
are simulating one model. But suppose we want to simulate the
effectiveness of a ballistic missile defense system. This system is
comprised of models for sensors, interceptors, guidance systems,
command and control, and so on. So in this instance we are simulating a
system of models.
There are several types of simulation models, some of which you may
have heard.
59
Monte Carlo Simulation Model
Monte Carlo simulation is used to estimate stochastic processes where
either the underlying probability distributions are unknown or difficult
to calculate by exact computations. Monte Carlo simulations sample
probability distributions for each variable to produce hundreds or
thousands of possible outcomes. The results are analyzed to get
probabilities of different outcomes occurring. For example, a
comparison of a spreadsheet cost construction model run using
traditional “what if” scenarios, and then run again with Monte Carlo
simulation and Triangular probability distributions may show that the
Monte Carlo analysis has a narrower range than the “what if” analysis.
This is because the “what if” analysis gives equal weight to all scenarios,
while Monte Carlo method hardly samples in the very low probability
regions. The samples in such regions are called "rare events". Monte
Carlo simulations may be used to:
60
Discrete Time Simulation Models
In Dynamic Time simulation, the models are iterated of fixed time
increments. For example, we may want to experiment to the operation
of a car wash over time and increment the model every ten minutes to
observe the frequency that it is used through the working day. We might
make decisions about its operation based on the frequency of use in
peak hours. Or we might want to see how an interceptor missile
performs over time as it attempts to destroy a theater ballistic missile,
making observations every second. We can use these simulations to:
Manufacturing plants
Communications networks
Transportation systems
Combat operations
Other systems that involve service, queuing, and processing
61
Simulation Architectures
Simulation architectures exist for running mixed time simulations, such
as discrete time and discrete events simulations. One of the earliest
architectures is known as Distributed Interactive Simulation or DIS. The
High Level Architecture (HLA) is another example.
Conclusion
We use simulations to experiment with systems and perform analyses
without having to use the actual systems. For example, it would be very
disruptive to a call center if we performed experiments with it during its
normal operating hours, which in many cases is 24-7. However, if we can
construct a reasonable model of the call center, then we can use it to
simulate call center operations and perform our experiments using the
simulation.
62
What are Stochastic Models
Is there a chance?
In my previous post on simulation, I used the term stochastic. What does
that mean? This article is about the meaning of “stochastic” and its
counterpart, “deterministic”.
Stochastic
Stochastic is synonymous with “random.” The word is of Greek origin
and means “pertaining to chance” (Parzen 1962, p. 7). It is used to
indicate that a particular subject is seen from point of view of
randomness. Any system or process that must be analyzed using
probability theory is stochastic at least in part. Stochastic systems and
processes play a fundamental role in mathematical models of
phenomena in many fields of science, engineering, and economics.
Familiar examples of processes modeled as stochastic (or stochastic time
series) include stock market and exchange rate fluctuations, signals such
as speech, audio and video, medical data such as a patient's EKG, EEG,
blood pressure or temperature, and random movement such as
Brownian motion or random walks. Stochastic models are sometimes
referred to as “probabilistic models”.
63
Plot of S&P Composite Real Price Index, Earnings, Dividends, and
Interest Rates using data from
http://irrationalexuberance.com/shiller_downloads/ie_data.xls
Deterministic
A deterministic system is one whose resulting behavior is entirely
determined by its initial state and inputs, and which is not random or
stochastic. Processes or projects having only one outcome are said to be
deterministic their outcome is 'pre-determined.' A deterministic
algorithm, for example, if given the same input information will always
produce the same output information. An example of a deterministic
system is the physical laws that are described by differential equations,
even though the state of the system at a given point in time may be
difficult to describe explicitly. An example of a deterministic model is a
logistic regression model of customers’ propensity to buy life insurance
products. A stochastic system is one that is non-deterministic.
References
Parzen, E. Stochastic Processes. Oakland CA: Holden Day, p. 7, 1962.
64
What Are Neural Networks?
Are you neural?
What I really meant to say is, “Are your models neural?” Should they be?
Could they be? I have built some crazy models, but nothing like a neural
network or auto-neural network (ANN). Maybe we should ask “Is you
modeler neural?” Be that he may be, here is a layman's
explanation...hopefully (just remember I am a modeler and cut me some
slack).
The key element of this paradigm was the novel structure of the
information processing system. It was composed of a large number of
highly interconnected processing elements (neurons) working in unison
65
to solve specific problems. ANNs, like people, learn by example. An ANN
is configured for a specific application, such as pattern recognition or
data classification, through a learning process. Learning in biological
systems involves adjustments to the synaptic connections that exist
between the neurons. This is true of ANNs as well.
There was a re-emergence of interest and research in the late 1970s and
early 1980s. However, in the 1990s, neural networks were overtaken in
popularity in machine learning by support vector machines and other,
much simpler methods such as linear classifiers. Yet the neurons would
not dies and between 2009 and 2012, work at the Swiss AI Lab IDSIA
produced award winning neural networks in competitions in pattern
recognition and machine learning.
66
optimal function (or optimal solution) in this set of functions, or a
function that solves the task in some optimal sense. The simple ANN can
consists of three layers as depicted below. Within the hidden layer,
learning occurs. In very simple pseudo-code, where f is an input node h
is a hidden node and f* is an output node, this might look like:
67
complicated or imprecise data, can be used to extract patterns and
detect trends that are too complex to be noticed by either humans or
other computer techniques. A trained ANN can be thought of as an
"expert" in the category of information it has been given to analyze. This
expert can then be used to provide projections given new situations of
interest and answer "what if" questions. ANNs are universal
approximators, and they work best if the system you are using them to
model has a high tolerance to error. You would not want to use a neural
network to balance checkbook! However they work very well for:
Before using an ANN model, you should have (or if you are not the
modeler, the modeler should have) an understanding of the
mathematically complexity of ANNs. To some degree, they “black box”
models; that is a model in which you cannot see the internal workings.
In other words, using our pseudo-code, we cannot see how h is learning.
References
http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-
winning-competitions 2012 Kurzweil AI Interview with Jürgen
Schmidhuber on the eight competitions won by his Deep Learning
team 2009–2012
68
McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas
Immanent in Nervous Activity". Bulletin of Mathematical Biophysics
5 (4): 115–133. doi:10.1007/BF02478259
69
What is Discrete Event Simulation?
I do not go to the bank very much anymore. My handy iPhone app allows
me to deposit checks online and the ATM gives me fast cash. But, have
you ever waited in line for service at a bank on your lunch hour? Enter
Discrete Event Simulation...
70
studying reliability, availability and maintainability (RAM)
just about anything that involves queuing (arrivals, waiting time,
service time)
71
What is Predictive Analytics
Missing?
Simulation! Oh sure, there are pockets of analysts using simulation in
predictive analytics and pockets of companies that use the results, but
the use of simulation to predict behavior (e.g., propensity to do
something) is not widespread.
I am taking a break today from the “how to” and “X reasons why” articles
and offer instead a little editorial contemplation.
72
Understand why observed events occur
Identify problem areas before implementation
Explore the effects of modifications
Confirm that all variables are known
Evaluate ideas and identify inefficiencies
Gain insight and stimulate creative thinking
Communicate the integrity and feasibility of your plans
73
We are doing just fine
The response to this might be that they already understand these
phenomena well and they already know what variables are important.
They probably did at one time. However, I know for sure my spending
behavior, engaging behavior, borrowing behavior and so on, has been
altered drastically by a struggling economy. “Oh, but our economy is
healthy,” you might say. So then, why do products and services cost
more and I get paid less. Ten years ago, a Senior Operations Research
Analyst was worth $X thousand, but today he or she is worth $(X – Y)
thousand, but that energy bill keeps rising without bound. Should we not
even test the idea that the world has changed and what we used to know
is now just a mystery? However, we are doing just fine, aren’t we?
“Almost nobody’s competent, Paul. It’s enough to make you cry to see
how bad most people are at their jobs. If you can do a half-assed job of
anything, you’re a one-eyed man in the kingdom of the blind.”
The Operations Research analyst should have one good eye and hence
lead the blind to safety. I argued before about how OR analysts must
have a holistic approach to problem solving, and that includes
simulations. “Oh, but I have not delved into simulation before”, one
might say. My response is, “what are you waiting for?” LPs, IPs, and
MILPs (if you are an OR analyst you know what these are) are not going
to solve every problem we face, but it seems to be the focus in many OR
educational programs. Get out of your comfort zone and get busy being
a holistic problem solver.
74
Break the rules!
Every OR analyst should have a little bit of rebel inside them. Our
approaches must be novel in many cases and the rules that say
regression models must only be broken if we are to be worth our weight
in salt (since we are not on a gold standard). When our models are not
producing much lift (or net lift), something may be broken. It could be
the data or the preconceived notion of customer propensity. Simulation
can help find the root causes, for you can simulate a system or
phenomenon without ever touching the real, operational system.
A little at a time
That financial industry might not be ready to pour the weight of their
analytics resources into simulation, and that is understandable. So,
perhaps start with a pilot study, get some results and show them how
simulation can help them. Imagine where we might be if the financial
industry had been doing this kind of “What if” analysis in the early part
of this century. Would we have seen the events of 2008-ish coming?
Will it crash?
We do not fly launch vehicles in space without simulation, nor launch
test missiles without simulation. It is too expensive—in people and
material—to launch and fail. In our early years of space exploration, we
experienced many crashes before the first human was launched into
space, and we have had some failures since then. But a lot of time, effort
and money go into simulation for these programs. We used simulation
before executing Operation Iraqi Freedom, and one senior officer was
quoted saying that we had fought that battle over and over in
simulation, prior to deploying, which made the fight look easy (even
though there is no “easy” in warfare). Imagine how many lives were
saved by simulation. Will it crash? Of course it will, but to simulate the
crash is much better than experiencing the crash in real time.
75
Conclusion
I have thus far failed to impress my financial institution customer with
the importance of simulation. Technically, I am a failure as an OR analyst.
Sure, I have given my customer the product they asked for, but
sometimes the customer does not know what they really need. It is
incumbent on us to help them to ask the right questions, develop the
best business cases, and use the correct solution method—and that may
be simulation.
76
Part IV – Statistical Maters
There are three types of lies — lies, damn lies, and statistics.
― Benjamin Disraeli
77
political rival William Gladstone. He enjoyed the favor of Queen Victoria,
who shared his dislike of Gladstone. Benjamin was not a statistician!
I couldn't claim that I was smarter than sixty-five other guys--but the
average of sixty-five other guys, certainly!
Outside of the killings, DC has one of the lowest crime rates in the
country.
― Marion Barry
78
Can we lie with statistical models?
We modelers say that our models are only as good as the data they are
built upon. We try to select data that is predictive, without the biases of
our positions or theories. Sometimes the phenomena we are trying to
predict, say the propensity to shed a product, is just not there. It could
be that our intuition is wrong. It could be that the data is wrong. It could
be that we do not have all the pertinent data. Hence, we are often left
with the conclusion that we need more data, better data, or we cannot
build a predictive model. Again, there is nothing wrong with statistical
methods, say linear regression. The problem lies with the data: not
enough, too much, missing data, and so on.
One of the first things taught in introductory statistics textbooks is that
correlation is not causation. It is also one of the first things forgotten.
We may at once admit that any inference from the particular to the
general must be attended with some degree of uncertainty, but this is
not the same as to admit that such inference cannot be absolutely
rigorous, for the nature and degree of the uncertainty may itself be
capable of rigorous expression.
80
Why is Analysis like Hiking?
I was in the Rocky Mountains last week, near Buena Vista. I went on
several hikes along the Colorado Trail and over some of the roughest
terrain I have ever hiked, due to the rainfall and resulting washouts.
Afterward, I went to my customer’s site for a few days. I missed the
hiking, but as I reflected on it the reason was not just because I was at
work again. It was because it was like doing analytics.
Planning
Hiking requires extensive planning. My longest hike was 16 kilometers.
Part of the hike (toward the end) was on the Colorado Trail. The
beginning was up a set of switchbacks that went from about 2740 meter
to 3050 meters. At the top, I had to go cross-country until picking up the
main trail. During this first portion I would probably not see another
human being. I would also be exposed to the intense sun. Since I was
hiking alone, I needed to let someone know my route and the time I
expected to return. If I were incapacitated on the isolated portion of my
hike, it might be days before I was found. I also had to plan my load,
81
which needed to provide protection from the sun and mosquitos,
protection from potential rainfall, water, and so on.
Planning an analytics project is much the same. You have to consider the
path that your analytic solution will take you down, where you might go
astray, different contingencies, bottlenecks, time, tools required and
how you would finish.
Execution
Executing the hike started with walking up a dirt road for half a mile to
the base of the switchbacks, ascending to the top of the ridge, going
north to the Colorado Trail, the west along the top of the ridge to the
descent, and back to a dirt road for a 1.5 mile walk to the basecamp.
Although I was familiar with the route, rainfall in May had altered he
appearance of the terrain and the path to the Colorado Trail was difficult
to follow, so I had to dead-reckon until I came upon a cairn (a small pile
of rock used to mark a trail). Once I found the trail, there was a gradual
ascent to about 3050 meters. Here I stop for lunch and rested my legs
that felt like jelly. The descent down the Colorado Tail would require
strong legs and concentration. Often going down is much more difficult
than going up, especially if you are carrying a load. Fortunately, I did not
encounter any rain or storms and made it back to camp an hour early.
82
you start climbing over rough terrain, attempting to a solution at the
apex of your approach. Once you get your solution, you start your decent
of post-processing and evaluating your solution. Finally you have to
interpret and explain your solution on your way toward project
completion.
Relating the analysis story is similar. It almost requires a story teller who
either has never hiked or has hiked and also travel by horseback. The
storyteller has to be in the culture of their listeners and in the proper
context. Obviously, we would not use terms like mean and type I error,
but use average and false positive. We may believe that everyone know
what a “mean” value is, but many in fact do not. We have to allow the
stakeholder to “experience” our analytic journey within their culture and
context.
Conclusion
The two words 'information' and 'communication' are often used
interchangeably, but they signify quite different things. Information is
giving out; communication is getting through.
—Sydney J. Harris
83
If we are not ensuring that our information is getting through to the
client, we not only fail to relate our analysis journey, we also fail to
deliver our analytic solution, for delivery requires both deployment
implementation. Thus, as we hike, or climb, or ride, we also analyze.
84
12 Ways Not to Please Your
Customer
I guess that "250.5 ways NOT to please your customer", seemed a bit
daunting. Thus, I am going to cut it to the top twelve—the first six are
additions to the original 6 (of 250.5). If you do not have a direct
customer, then interpret customer as boss or supervisor. Also, if your
spouse is your boss—as mine is, but I am not implying she is bossy—then
interpret customer as spouse or significant other, which could be your
Great Dane. I had intended to add a few every week, since there are 1א
(pronounced aleph-1 and you’ll have to read my post on Jessica Rabbit
and the Number System) ways too NOT please your customer. Now I
think I will just write twelve and most others are variations. I have
committed most of the taboo actions at one time or another, so this is
based on lessons learned by a slow learner.
1. Do not get to work after your customer, unless your customer goes
to work a 2:30 AM, or has scolded you several times for being too
early. I always want to be available to my customer, especially if
they come in early and want to get thing accomplished be people
typically show up for work. Sometimes when they know you are
85
going to be there for them, they turn to you as someone they can
depend upon.
2. Do not go home before your customer, unless they are a hardcore
workaholic. At least check with your customer before you leave for
the day, to make sure they have no pressing deadlines, or
something just came up that they need help with. While others are
having a beer at Bubba’s Bar, you will be gaining the trust of your
customer. However, do this within reason if you have a family. Your
family is more important than going the extra mile for your
customer.
3. Do not blindly follow the instruction of your customer. Wait, will
that not displease them? Not if you do it selectively and with tact.
You have the responsibility to help you customer do the right thing
and make them successful. It is not your job to shine. It is your job
to make sure them shine. If you want glory, change careers—
become an actor, a singer, a professional quarterback, etc.
Continued business and a paycheck is your reward. Ask questions
that will cause the customer to consider, or reconsider courses of
action, and do so with great humility. “Mam, I know you are
probably right about this, but have you considered X?” In the end
when the customer says execute, do so as I it was your own idea,
unless ethics are involved.
4. Do not let your customer fail due to lack of planning. Make sure
they have (or you have for them) considered redundancy and
contingencies. Remember Charlie Beckwith, always take an extra
helicopter (look up Operation Eagle Claw). Sub-rules: (1) Always
use pencil—things change often, and (2) Do things faster, harder,
and smarter
5. Never work on company business or personal stuff on the
customer’s time. Even if you are not charging, if this occurs during
normal work hours, this could appear as unethical—perception is
everything, and this is a quick way to lose trust, and perhaps your
contract. Sometimes your contract will specify things that you can
do on the customer’s time, like company ethics training. If in doubt
ask—questions do not cost as much as breaches of integrity.
6. Do not withhold information from your customer (unless it is
proprietary and you have signed an agreement saying that you will
not). This is not the same as lying, but it is related, and will destroy
trust just as quickly as lying. Don’t use information as power. Give
them information to leverage as power.
86
7. Do not tell them what they can’t do. Our job is to help do what
they want to do, to achieve what they want to achieve. We have to
help them fulfill their vision. That does not mean that we cannot
advise them about risks and issues—we need to be diligent in
doing this—but we have to also help them mitigate the risks and
find a solution to get there. If it is NASA—well, at least before the
next lunar exploration program was cancelled—then we have to
help them reach the moon.
8. Do not tell them that something they want is out of scope. At least
do not do this directly and without discussing scope with your
contract representative. Going to the moon with a new launch
platform (like the Ares I) is probably not one of those “out of
scope” tasks. However, if they want you to build the Starship
Enterprise, at least without Mr. Scott on your team, and maybe a
Vulcan, then it might be out-of-scope. Often scope can be changed
to help the customer get to here they want to go.
9. Do not tell them they are wrong. Your customer is never wrong—
and if your boss is your spouse and your spouse is your wife, she is
never wrong, unless you have a really nice doghouse in the back
yard. Your customer’s idea, desire, command, etc., may be
misguided (out of alignment with their vision), and our job is to
also help them align tasks, project and programs with their vision
and to provide a roadmap to get there. Now, they may be wrong,
but if you help guide them, they will come to this realization
themselves, usually.
10. Do not—never ever ever—lie to your customer. In fact, never lie to
anyone. It is perfectly okay to say that you cannot reveal
information due to its security classification of proprietary nature,
but never lie. If you did not cross a “t” or dot an “i”, then own up to
it and do it. This will, at a minimum, fulfill your contract obligation
to complete a task, but it may also say to your customer, “this
person can be trusted, because they always tell me the truth.”
Now, you may be incompetent and truthful at the same time, at
the incompetence mat get your fired, but at least you maintain
your integrity. The author of the Piano Player [Kurt Vonnegut] said
this: "Ninety percent of people are incompetent at what they do.
You would be surprised at how really bad they are. If you are
marginally competent, you may go far." Thus, if you a marginally
competent, non-liar, you at least have a chance of pleasing your
customer.
87
11. Do not berate your customer behind their back—or to their side, or
over them, or in a parallel universe. They WILL “hear it through the
grapevine.” I have been a supervisor/manager and as such suffered
the natural phenomenon of beratization—add that to your
dictionary—but sooner or later, I always here about it, even if I
have to sneak around and eavesdrop. Do not think your customer
will not do the same. Nothing good can come from it!
12. Do not mention your customer on social media, even if it is in
praise, unless it is part of a task on your contract to do so. Without
belaboring the point, nothing good can come from it!
88
Analytics and Statistics: Is there a
difference?
Is there a difference between statistics and analytics? In previous posts,
I have claimed that there is a difference. Here, I will attempt to explain
my reasoning for approaching this conclusion.
What is Statistics?
Statistics is the study of the collection, analysis, interpretation,
presentation, and organization of data [1]. In applying statistics to, e.g.,
a scientific, industrial, or societal problem, it is necessary to begin with
a population or process to be studied. Populations can be diverse topics
such as “all persons living in a country” or “every atom composing a
crystal”. It deals with all aspects of data including the planning of data
collection in terms of the design of surveys and experiments [1].
89
(sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value, while dispersion
(or variability) characterizes the extent to which members of the
distribution depart from its center and each other. Inferences on
mathematical statistics are made under the framework of probability
theory, which deals with the analysis of random phenomena. To make
an inference upon unknown quantities, one or more estimators are
evaluated using the sample.
What is Analytics?
Actually, analytics and statistics share a common thread: they both use
statistical procedures and analyses. However, unlike statistics, the
analytics scientist often deals with analyses where there is no assumed
null hypothesis, and subsequently employs machine learning algorithms
in the analyses. This is an important distinction in understanding my
point of view. For instance, I very seldom approach predictive modeling
90
with a null hypothesis. On the other hand, I do perform statistical
analyses with a null hypothesis at the forefront—there is a subtle but
distinct difference.
I will not go into great detail here, since my last post described Predictive
Analytics, but I will summarize a description of analytics. Analytics is the
discovery and communication of meaningful patterns in data. Especially
valuable in areas rich with recorded information, analytics relies on the
simultaneous application of statistics, machine learning, computer
programming and operations research to quantify performance or
predictions. Analytics often favors data visualization to communicate
insight [3].
Conclusion
If you accept the descriptions of statistics and analytics that I have
presented here, then there is an obvious difference between the two
disciplines, thought they do share some things. This is not meant to
belittle the work of the statistician. The statistician is very professional,
91
understands their discipline extremely well and provides valuable,
focused analyses. While the statistician is a “focused” of “specialized”
discipline, analytics is much more general—the analytics professional is
to some degree a “jack of all trades”. Often, the analytics professional
will turn to the statistics professional for help with analyses where
statistical hypotheses are paramount.
References
Dodge, Y. (2006) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-
19-920613-9
92
Are Statisticians a Dying Breed?
“Facts are stubborn things, but statistics are pliable.”
― Mark Twain
“Every time I sit with our general manager at a baseball game, and
there's number-cruncher and statistician guy - I'm sitting around - they
start talking about stuff, and I say, 'What's that? I've never heard of
that one before.'”
― George Brett
People have been espousing the idea that statisticians have no role in
analytics and data science. Almost all of what I do in predictive analytics
is statistical in nature. One of the most important statistical procedure is
93
design of experiments, both for collecting the data and for performing
the analysis. Statisticians are very good at this.
“To consult the statistician after an experiment is finished is often
merely to ask him to conduct a post mortem examination. He can
perhaps say what the experiment died of.”
― Ronald Fisher
“Definition of a Statistician: A man who believes figures don't lie, but
admits than under analysis some of them won't stand up either.”
― Evan Esar
“The greatest value of a picture is when it forces us to notice what we
never expected to see.”
– John Tukey
94
“I couldn't claim that I was smarter than sixty-five other guys--but the
average of sixty-five other guys, certainly!”
95
Statistics is Obsolete
Experts often possess more data than judgment.
— Colin Powell
― Michael Shermer
I am sure you have heard of BIG DATA, but have you heard of ALL DATA?
Probably not, because I think I may just invented the term. Apparently
96
there are some major players who actually believe that when they have
"all of the data", they no longer have to take samples, and that statistics
only applies to samples. Since we do not need to take samples anymore,
then we do not need statistics anymore. By the way, Statisticians, we do
not need you anymore. Go find a nice job as a greeter at Walmart, not
that there is anything wrong with Walmart greeting—I kind of like the
special touch.
— Albert Einstein
First, the CEO’s, CTO’s, and other O’s that are espousing this view
absolutely have no earthy idea (or one from a parallel universe) of what
they are talking about. I would call them naïve, but they are really
somewhere between “so full of themselves” and just plain oblivious to
reality! I think their issue is their lack of understanding statistics,
statistics anxiety, or a combination of the two. But, ALL DATA will take
care of that, since it will not require any statistics.
Second, ALL DATA does not equal PERFECT DATA. I deal with colossal
data sets. In my experience roughly 80% of the data is useless when
building a model of some phenomenon, like the propensity to buy
product X or the probability that the O-ring on component Y of the latest
space launch vehicle will fail. Moreover, regarding the 80% of useless
data, about half of that is really bad data. Of the twenty percent that is
useable, only about 10% of it actually contains strong predictors. And
this stuff is pretty easy to figure out using Weight of Evidence (WOE) and
Information Value (IV) algorithms. But never fear, ALL DATA has
morphed into ALL-PERFECT DATA.
Third, I think it is great that we have ALL-PERFECT DATA, but how are
you going to process it? I know, you’ll just push a button get a prediction.
And inside that “black box” there are no statistics involved. There is no
regressing, no averaging, no transforming, no imputing, on so on. The
97
data is just speaking for itself. Oh wait, we have gone from ALL-PERFECT
DATA to something much better, TALKING DATA.
One person's data is another person's noise.
― K.C. Cole
98
Fifth, I do not really have a fifth so I am going to ask my BIG, ALL,
PERFECT, TALKING, MAGICAL DATA for its opinion. What was that? Our
UNICORN said, "all those people who think we do not need statistics are
not learned!" Now, do not get angry with me. I did not say that. It came
for the UNICORN you created…
— Aaron Levenstein
That is an easy one, fire all the brainless CEO’s, CTO’s and other O’s who
wouldn’t know a statistic if it slapped them in the face. But that would
not work. Who would create the UNICORN?
99
Math, Physics and Chemistry are
Obsolete
I mentioned in “Statistics are Obsolete” that there are some pretty
influential people who reject statistics as the underlying science of data
analysis with respect to big data, or in their words ALL DATA (specifically,
“We have all of the data.”). I think the essence of the issue for those
people is that simply do not understand statistics, or for that matter any
science that has “formulas”, like math, physics and chemistry.
The ALL DATA proponents seem to cluster with those who have softer
quantitative degrees or degrees in management, psychology, and so on.
Of course, cluster analysis is obsolete, as are principle component
analysis, regression analysis and so on (and probably any kind of analysis
that involves using the scientific method). So just ignore the clustering.
If the ALL DATA people prevail, then I suppose Math, Physics and
Chemistry are obsolete, as well. Therefore, let’s be proactive and shut-
down all the mathematics, physics and chemistry programs. All you
math, physics and chemistry majors, you should switch majors to
anything that does not require mathematics. All you tenured professors
out there, well, you just lost your tenure, but I hear MacDonald’s is
hiring.
100
Parents, stop teaching your kids how to count. I know it’s cute, but it’s
also futile in light of ALL DATA. We do not need anything numerical
anymore. In fact, we do not need anyone to think anymore. These
absolutely brilliant ALL DATA supporters and their THINKING DATA will
think for us (they are not really brilliant, just cleaver).
If you exceed these requirements, then you are over-qualified. After all,
with the possession of ALL DATA you just need a chimpanzee to push a
button on a black box and predict anything you care to predict, because
it is in the data.
101
Part V – Data Science Concerns
Call it what it is
I think that life would be better in all things data if we just called thing
what they are, like calling mathematicians by their function: number
theorist, math teacher, mathematical modeler, algebraist, etc. How
about datatician? Will that cause confusion with statistician?
102
Mimic Mathematical Sciences?
Maybe “data science” is okay and the problem is “data scientist”. We
have “mathematical sciences” (note the plural), but few mathematicians
would call themselves a “mathematical scientist”, even though that
sounds pretty cool. The problem is it doesn’t describe what one does in
the field of mathematics. Algebraist, however, says that one works in the
mathematical area of algebraic structures. The mathematical sciences
might looks like this (I borrowed these from the National Science
Foundation):
Algebra
Number Theory
Analysis
Applied Mathematics
Combinatorics
Computational Mathematics
Foundations
Geometric Analysis
Mathematical Biology
Probability
Statistics
Topology
103
A Data Science Taxonomy?
Data architect
Data processor
Data miner
Data modeler
Data analyst
Data explorer
Database administrator
Database developer
Algorithm developer
etc.
104
Machine learning architect
Neural networker
Classification tree farmer
Random forest ranger
(now I am just being silly)
Conclusion
Somehow, “data scientists does not cut it for me, and there are many in
the industry and the field that have similar issues. I have read and ben
told that many organization will no longer hire data scientist, partly
because there is no standard for what a data scientist is or does. I do not
think we will reach a standard, so let’s label a person who works with
data by what they actually do. Then hire them accordingly.
105
So you think you are a Data
Scientist?
Many are writing articles about the skills required to be a data scientist.
I disagree with most of them. Before you can be a data scientist, you
must qualify to be a scientist. A degree in management does not qualify
you to be a Chemical Scientist or Mathematical Scientist, nor does it
qualify you to be a Data Scientist. A degree in Management Science or
Operations Research could qualify you, depending on the course
content.
106
This guy might qualify to be a data scientist
107
not qualified to be an Assistant Scientist. Though I may be qualified to
be a Principal Data Scientist, I do not use the title. Instead, I use Big Data
Analytics Consultant, being careful not to misrepresent myself.
Scientist I/II
MS (or BS plus three years of experience) in mathematics,
science, engineering or equivalent
Two to five years research experience or equivalent
course/project work
Demonstrated object-oriented analysis and design skills
Knowledge of and experience with a variety of software
languages and platforms
Proven research skills
Proven ability to understand appropriate algorithms and
concepts
Demonstrated ability to employ a hierarchical writing process to
generate clear, grammatically correct documents
Demonstrated ability to provide technical guidance to interns
and other more junior team members
108
Strong analytical skills
Strong fit with Company goals and values (work ethic, attitude,
interest in work)
110
5 Signs that you might be a Data
Scientist
Why are people so down on Data Scientists? Probably because they
don't know what one looks like. There are a lot of folks out there calling
themselves Data Scientist just because they had some introductory
statistic courses, and maybe they can make really cool charts in Excel. All
that makes you is someone who had some introductory statistic courses
and can make some really cool charts in Excel. So what might a real Data
Scientist look like? Here are some signs that you might "really" be a Data
Scientist.
111
2. You have a graduate degree in a science or related field, i.e.,
Analytics, Predictive Modeling, Risk Analysis, Applied Statistics,
Database Architecture, etc. Most occupations where the title
“Scientist” is used, requires a doctorate degree. I am taking the
liberty to soften the requirement to a Masters of Science degree.
Sorry, a MBA does not make the cut. Alternatively, you professional
certification like Certified Analytics Professional (CAP).
3. See 1 and 2. Moreover, you include some or all or the following skills
in your profile: Data Scientist, Analytics Scientist, SPSS Modeler, SAS
Programming, Regression Analysis, Statistical Modeling, Logistics
Regression, Machine Learning, SQL, Python, etc.
4. See 1 and 2. Additionally, your boss or customer listens to you and
takes action when you present your analysis and recommendations.
Moreover, profit was made or money was saved (or for non-profits
some measure of performance was achieved) based on your work.
5. See 1 and 2. Not only can you crunch numbers with the best of them,
you can logically present the results of your analysis in plain
language, or at least your customer’s language. Additionally, you
make contributions to the body of knowledge through scientific
experimentation or studies, resulting in blogs, articles, papers, or
presentations.
The real scientists worked very hard to become one; they did not just
attach a label to their title. That does not mean that only scientists work
with data. Beside data scientists, other people who work with data may
include data analysts, data consultants, data architects, data miners,
analytics consultants, and so on. However, to be a data scientist you
need the prerequisites outlined in 1 and 2. For hiring managers, project
managers, human resource professional, etc., when you hire a data
scientist, make sure you actually hire a scientist who works with data.
Otherwise, do not call them "scientists". You might be less disappointed
in their work.
112
How can I be a Data Scientist?
Several aspiring data scientist and analytics scientist have asked me how
to achieve their knowledge and training goals. Though there are many
ways to accomplish this, below are two options offered through
Coursera.
113
Cluster Analysis in Data Mining
Text Mining and Analytics
Data Visualization
Data Mining Capstone
If you need SAS for training or self-study, you can get the university
addition through the link below. You do not have to be student or
faculty, but you cannot use it for work.
http://www.sas.com/en_us/software/university-edition.html
http://hortonworks.com/products/hortonworks-
sandbox/#install
114
Why you might not want to be a Data
Scientist
Data Scientist was one of the world’s most popular technical jobs, at
least in 2014. From what I have been told, of all the major countries, it
was second only in the United States. Universities everywhere are
creating, or have already created, graduate programs in Data Science.
Data is becoming more and more available and in large quantities.
Recently, President Obama appointed a US Chief Data Scientist (Dr. DJ
Patil). Data science is at center stage!
So, why would you not want to be a data scientist? The following reflects
my opinion only, and should not be construed as advice, merely
something to consider.
Saturation
We are putting more and more data scientists into the field every day.
Personally, I am asked several times a month to consider filling a data
scientist role here or there. Today, there are ample opportunities for
data scientists. It may be different tomorrow. If you are thinking about
becoming a data scientist is some way, shape or form, especially if you
have just started college or will start soon, consider the idea that the job
115
market may be saturated by the time you are qualified. Ask yourself if
there are alternative courses of study and course of action.
Data without a human touch? I do not know, nor does my crystal ball
show me, a glimpse of the impact that IoT could have, but it is something
to consider.
Imposters
I am sure I will get in trouble here, but it would seem to me that before
you can be a data scientist, you have to first be a Scientist. The terms
Scientist, Engineer, Analyst, and so on, have specific meanings. For
instance, the Government Services Administration defines a scientist as
a subject matter expert who:
116
Applies principles, methods and knowledge of the functional
area of capability to specific task order requirements, advanced
mathematical principles and methods to exceptionally difficult
and narrowly defined technical problems in engineering and
other scientific applications to arrive at automated solutions.
Data Architect
Data Communication Manager
Data Warehousing Administrator
Data Warehousing Analyst
Data Warehousing Programmer
Data/Configuration Management Specialist
Database Analyst/Programmer
Database Manager/Administrator
If you are a hiring manager, watch out for imposters. If you are an
aspiring data scientist, make sure you are qualified before you make
your claim. Imposters have the potential for damaging the reputation
and expectations of the occupation.
Alternatives
There are really some cool alternatives to becoming a data scientist and
still being involved with data science. I do not call myself a data scientist.
Instead, I claim to be a "predictive analytics scientist". However, by prior
117
training and occupation, I am really an Operations Research Analyst (or
Scientist). Operations Research (OR) has been around since World War
II and OR analysts did things like plan the Normandy invasion (Operation
Overlord). The International Forum for Operations Research and
Management Science (INFORMS) now sponsors the Certified Analytics
Professional (CAP) program. Many universities offer degree program or
concentrations in OR. ORs work for cruise lines, airlines, the
entertainment industry, the wholesale and retail sales industries, the
trucking and train industries, the automotive industry, manufacturing
industry, and the departments of Defense, Transportation and others. In
addition to data "stuff", they perform optimization, predictive
modeling, inventory control, and simulation.
118
I am an Analyst, really!
Many people say they are analysts, as I do. But are we really analysts?
Do we perform analysis? That is, do we take problems and break them
into smaller pieces to solve, and then develop an aggregate solution.
Analyst is a term for someone who uses an analytic approach to problem
solving. The opposite problem solving approach is intuition.
119
I have often written about data scientists and what actions and skills
they manifest most. However, not all Data Scientist are analysts--they
may not be engaged in problem solving. At any rate, it should be clear
that Analytics is strongly correlated to the problem solving process.
So, who are the real analysts, the ones we may be looking to hire, who
approach problem-solving by disaggregating problems into their smaller
parts? These are people who have specific training and experience,
either qualitative or quantitative, in the science and logic of a field that
many are calling Analytics.
There does exist, however, the situation where the intuitive problem
solvers call themselves analyst, and this may not necessarily be the case.
There is a need for this, but not in Analytics.
You could face a situation where you have a sociologist and a statistician
both claiming to be analysts. Suppose the latter routinely characterizes
data and then passes that information to someone else for problem
solving. We could argue that the statistician is not the analysts.
120
All Things Data
Data Scientist, Analytic Scientist, Statistician, Operation Research
Analyst, Predictive Modeler, … They are all very different… or are they?
Ask not what your data can do for you; ask what you can do with your
data. After all, to turn that data into useful information, you have to
scientisize it, or statisticize it, or researchacize it, or modelize it (all of
121
these are new words now). You might even have to analyze it, or
synthesize it, perform some other –izes upon it. And is that not what
statisticians, modelers, analysts, scientist who work with data do from
day to day?
So, instead of “how to become a data scientist” and “statisticians are not
data scientists”, why not “we are all a bunch of professionals trying to
turn tons of data into useful, applicable and actionable information and
this is how you can join us”?
Like Alan Turing, who had to get his machine to determine millions of
different combinations of Enigma settings to help in the fight against
Nazi Germany, so we too have to sift through millions of combinations
of variables to find useful information that will aid in our fight against
stupidity. That’s right, we have been stupid for too long. Regardless of
what CNN or FOX News tells you, we are not doing just fine. Look around
your neighborhood, or mine. We need to set aside our differences and
join our collective intellects to make this enormous amount of data that
is available to us, work for us and make us better tomorrow than we are
today.
If you are expecting our new Chief Data Scientist to solve all of our ills,
don’t hold your breath. The Department of Education didn’t work, other
than give jobs to people who have very little talent for getting things
done; the House and the Senate isn’t doing anything but fighting a
standoff and keeping the status quo; the IRS isn’t doing anything other
than than taking your hard earned dollars and redistributing them. Let’s
just fire them all and give Dr. Patil an army of “professionals who do
122
great stuff with data.” Well, that’s not going to happen, but still the
answer may be in All Things Data.
Other things I have read indicate that data scientists are carrying a
blemish which causes them not to be needed. No wonder! It appears as
if I give $10 to Vincent ‘Vinnie’ Antonelli [Steve Martin in My Blue
Heaven], then he can make me a data scientist. I have talked to
recruiters who have been disappointed in the actual qualifications of
some who label themselves a data scientist.
Are these people real? Two college courses plus R and Python? Although
my view may not be popular, a monkey can learn to use tools. Tools do
not make you anything other than a tool user. A deeper understanding
of data structure, data analysis, data modeling, data … is what one needs
123
to be Data Scientist. Conceptual understanding, not tools. If you have
the former, you can pick up the latter in no time.
Then what makes a data scientist? I would say that coursework leading
to a degree, conceptual understanding, a sprinkle of research, a dash of
new development, an occasional bath, plus “effectiveness” makes a data
scientist. When a client uses the product produced by the data scientist,
whether it be a database, a data architecture, a data model, or data
analysis, then we have “effectiveness”, and they person who did all of
this is a data scientist.
If you are a pilot of a Boeing 757 and you cannot land the aircraft (it may
be a skill you never acquired), then you are not a pilot. You are just an
airline employee with wings on your chest. Landing the plane is pretty
important, and “landing” your solution with the big data customer is
pretty important. When you talk as a data scientist, then clients should
listen and take action based on what you tell them.
124
Why your client might not be
listening
What is the Problem?
To effectively communicate, we must realize that we are all different in
the way we perceive the world and use this understanding as a guide to
our communication with others.
—Anthony Robbins
Have you ever worked your fingers to the bone developing a model or
other analytic solution to a problem for your client or department, and
then observed that your solution was never used? Is the boss or
customer listening to your advice? If the answers are Yes and No,
respectively, then there is a problem, but it is a common problem. There
a number of root causes for this, the the most common is
communication.
—Sydney J. Harris
Client Interaction
There is a group on YouTube called Studio C (If you have not watched
them before, you are missing out on some good comedy). I watched one
yesterday called Poke Face,
https://www.youtube.com/watch?v=XQ6_GdODuww. Each player is
showing their poker face and we get to hear each one of their thoughts.
One guy does not know the game at all—not the rules, the poker chips,
etc.—and when it is time to show his hand, he is showing UNO cards.
However, he kept his poker face the whole time. This is your client. Just
because you are an outstanding analyst, providing excellent analytic
solutions, does not guarantee that people will listen to you. You have to
speak their language.
Language is Important
Don't use words too big for the subject. Don't say infinitely when you
mean very; otherwise you'll have no word left when you want to talk
about something really infinite.
― C.S. Lewis
126
You have to speak his language, understand his culture, and recognize
his belief system.
I said this is a real example, but it is really a real example with the client
using their jargon with the analyst, rather than the other way around.
Recently, a colleague asked for some advice for an upcoming interview.
The telephone interview was with the CTO and this is what was taken
away from that interview. In answering what kind of projects would be
expected, the CTO said that the following could be expected
1. Message testing
2. Spent-time model of customers that are at risk and at grow
3. Atomization of models/algorithms/etc.
127
I build a model in SAS Enterprise Miner, I take the score code
and wrap it in an execution macro and then give that code to a
guy who deploys it for production on a mainframe. The same
guy runs the model each month along with about 20 other
models during the first week of the month. The process is
automated and the scores are sent to a marketing directory for
access. He also runs a model performance report on each model,
which is partially automated. An emerging process is real-time
modeling. The same models would be used, but they might be
written in Python in a Hadoop environment and the data would
be refresh a regular short time periods. This could be used to
direct traffic to call center. For example, if a caller has a high
score for propensity to shed a mortgage, that customer could be
directed to the mortgage company and provided some incentive
not to shed their mortgage.
Conclusion
The ability to simplify means to eliminate the unnecessary so that the
necessary may speak.
—Hans Hofmann
128
Part VI - Applications
Big data also has attracted the attention of enterprise managers and
their human resource (HR) managers. Many believe that they can now
analyze mountains of structured and unstructured data to answer
important questions regarding workforce productivity, the impact of
training programs on enterprise performance, predictors of workforce
attrition, and how to identify potential leaders. For the purpose of this
article, HR will refer to the enterprise function and human resource will
refer to the humans who are resources for the enterprise.
129
Hogwash!
The HUMAN in HR barely exists as it is. Now I hear that HR is going to
identify leaders through big data analytics (I am trying not to laugh too
hard). You would think that one would need to know what defines a
leader, but with all the “junk” articles I read on leadership, most people
including HR do not know what leadership is! Many use management
and leadership synonymously, and they are mistaken.
Root Causes.
I am not picking on HR. This is a common problem across disciplines, and
I know many competent, caring and professional HR personnel. Having
stated that, I often use a quote from Kurt Vonnegut, author of Player
Piano:
“Almost nobody’s competent, Paul. It’s enough to make you cry to see
how bad most people are at their jobs. If you can do a half-assed job of
anything, you’re a one-eyed man in the kingdom of the blind.”
130
Analytics also does not make bad products better, poor services better,
and so on. It may provide insight into why products and services are
perceived as being poor, but the only way to fix these problems is to
“make a better product” and “provide better services”.
In spite of the fact that I have been downplaying the need for analytics
in HR, when we know these things kinds of things, we can actually put
the HUMAN back in HR. I know, it is a paradox. The point is do not use
big data analytics to fix HR, rather, use it to enhance HR functionality. If
an enterprise feels it needs to fix its HR, then it should fix its HR before
adding big data analytics to the sauce. If HR is broken, adding analytics
is just going to produce more of the same sauce, and you will not like its
flavor.
Conclusion
HR is an important function for the enterprise, unless you replace
humans completely with machines, robots and androids. Improving the
enterprise’s HR function is a worthy undertaking and big data analytics
may help improve its functionality, but only if it is already functional!
132
Call Center Analytics: What's
Missing?
What is the Problem?
My customer frequently asks if this model or that model can be used to
direct call center traffic. Usually the models are projecting an acquisition
or engagement window with too much of a gap to perform this function,
and this is hardly part of their use case as acquisition models. However,
I can perceive a model that does perform this function. Yet a model is
not entirely what they need.
A model might be the answer they are looking for. However, you cannot
just build a model and leap into its use blindly, without testing. The best
way to test in this situation is using discrete event simulation.
133
What is Discrete Event Simulation?
Discrete Event Simulation (DES) is the process of codifying the behavior
of a complex system as an ordered sequence of well-defined events. In
this context, an event comprises a specific change in the system's state
at a specific point in time (arrival at the bank, service by a teller, etc.).
Rather than stepping based on a time increment, like every second, DES
advances based on events—events that may or may not be equally
spaced in time.
134
Predetermined starting and ending points, which can be discrete
events or instants in time (arrivals and departures, for instance).
A method of keeping track of the time that has elapsed since the
process began (waiting time, for instance).
A list of discrete events that have occurred since the process
began (begin service, for instance).
A list of discrete events pending or expected (if such events are
known) until the process is expected to end.
A graphical, statistical, or tabular record of the function for
which DES is currently engaged (plot of waiting times and service
times, for instance).
Using DES, we can test the model with multiple parameter, or test
multiple models, without disrupting the operations of a call center. Once
we determine which model (or which parameters) work the best, we can
then use that particular model to direct call, for instance. Continued
monitoring and used of the DES for simultaneous parallel testing can be
used to maintain and improve efficiency.
135
Win, Lose or Draw…Is this a Game?
What economists call game theory psychologists call the theory of social
situations, which is an accurate description of what game theory is
about. Although game theory is relevant to parlor games such as poker
or bridge, most research in game theory focuses on how groups of
people interact. There are two main branches of game theory:
cooperative and non-cooperative game theory. Non-cooperative game
theory deals largely with how intelligent individuals interact with one
another in an effort to achieve their own goals. That is the branch of
game theory I will discuss here.
Decision Theory
Decision theory can be viewed as a theory of one person games, or a
game of a single player against nature. The focus is on preferences and
the formation of beliefs. The most widely used form of decision theory
argues that preferences among risky alternatives can be described by
136
the maximization of the expected value of a numerical utility function,
where utility may depend on a number of things, but in situations of
interest to economists often depends on money income. Probability
theory is heavily used in order to represent the uncertainty of outcomes,
and Bayes Law is frequently used to model the way in which new
information is used to revise beliefs. Decision theory is often used in the
form of decision analysis, which shows how best to acquire information
before making a decision.
137
Example
One way to describe a game is by listing the players (or individuals)
participating in the game, and for each player, listing the alternative
choices (called actions or strategies) available to that player. In the case
of a two-player game, the actions of the first player from the rows, and
the actions of the second player the columns, of a matrix. The entries in
the matrix are two numbers representing the utility or payoff to the first
and second player respectively.
The Prisoner’s Dilemma (PD): The problem can be traced back to von
Neumann and Morgenstern [von Neumann, 1944] and, of course, John
Nash [Nash, 1953], you and a friend have committed a crime and have
been caught. You are being held in separate cells so that you cannot
communicate with each other. You are both offered a deal by the police
and you have to decide what to do independently. Essentially the deal is
this.
If you confess and your partner denies taking part in the crime,
you go free and your partner goes to prison for ten years.
If your partner confesses and you deny participating in the
crime, you go to prison for ten years and your partner goes free.
If you both confess you will serve six years each.
If you both deny taking part in the crime, you both go to prison
for six months.
What will you do? The game can be represented by the following matrix
of payoffs
Note that higher numbers are better (more utility). If neither of you
confesses, you both go free, and split the proceeds of their crime which
we represent by 5 units of utility for each suspect. However, if one of
you confesses and the other does not, the one who confesses testifies
138
against the other in exchange for going free and gets the entire 10 units
of utility, while the one who did not confess goes to prison and which
results in the low utility of -4. If both of you confess, then both are given
a reduced term, but both are convicted, which we represent by giving
each 1 unit of utility: this is called mutual defection.
This game has fascinated game theorists for a variety of reasons. First, it
is a simple representation of a variety of important situations. For
example, instead of confess/not confess we could label the strategies
“contribute to the common good” or “behave selfishly.” This captures a
variety of situations economists describe as public goods problems. An
example is the construction of a bridge. It is best for everyone if the
bridge is built, but best for each individual if someone else builds the
bridge. This is sometimes referred to in economics as an externality.
Similarly this game could describe the alternative of two firms
competing in the same market, and instead of confess/not confess we
could label the strategies “set a high price” and “set a low price.”
Naturally it is best for both firms if they both set high prices, but best for
each individual firm to set a low price while the opposition sets a high
price.
References
Nash J. Two-Person Cooperative Games. Econometrica 21: 128-140,
1953.
140
Part VII – The Power of
Operations Research
141
Experimental and Engineering Design
Manufacturing and Production
Logistics and Transportation
Supply Chain Management
Enterprise Resource Planning
Communications
Interfaces
Networks
142
Scheduling
Routing
Manpower
Modeling
Mathematical Programming
Computing Technology
Probability and Statistics
Stochastic Simulation
Systems Analysis
Organization Theory
Accounting Principles
Engineering Economics
Decision Analysis
Game Theory
Heuristics
Computer Programming
Numeric Methods
Stochastic Analysis
Queuing Theory
Evolutionary Algorithms
Dynamic Programming
Mathematical Theory
Statistical Theory
Computing Theory
Economic Theory
143
What is their history?
Operational Research was born during the early year of WWII and
matured rapidly. One of its primary function was the planning of
Operation Overlord or the Normandy Invasion. It has its foundations in
mathematics, computing and economic theories, on which basic tools in
optimization and simulation are built. Today OR’s are employed by
airlines, train lines, logistic systems, delivery systems (e.g., FedEx),
defense systems, military, oil companies, insurance companies, financial
institutions, manufacturing, marketing and many more.
144
Getting the Question Right
As an analysts and modeler I have a variety of customers with a variety
of problems in business. We usually frame their “question” as a business
case. The question of course is something they require an answer for it
usually stems from a problem or a desire to have more Share of Wallet
or other metric. What is interesting here is though they know they have
a question that needs to be answered, they often do not know how to
state their question.
So, this idea of what is the real question is not new and it is not unique
to business.
145
As an analyst, we have to draw that real question out through dialogue
with our customers. Doing this is often as much an art as it is a science.
If we do not do it, we stand a good chance delivering a well-built solution
that happens to be the wrong one. I have done this at least twice in the
last three years, knowing that getting the question right was paramount.
In spite of my failures I have found some keys to help with formulating
the business case.
146
I am sure there is a lot more that can be said about this subject, but this
is what I have learned over time and I am just one analyst. The fact
remains that if we do not get the “real question”, then we may not
provide the “right solution.”
There are other interesting pieces of Roske’s Recipe I did not address
here. Perhaps another article?
Reference
Mr Vincent P. Roske, Jr., “Making Analysis Relevant,” PHALANX Volume
31, Number 1, March 1998
147
Holistic Analysis and Operations
Research
Introduction
I am not really sure what holistic analysis is, so I will define it. Our English
word comes from the Greek ὅλος (holos, meaning “all, whole or entire”).
Reductionism may be viewed as the complement of holism.
Reductionism analyzes a complex system by subdividing or reduction to
more fundamental parts. For businesses, knowledge and know-how,
know-who, know-what and know-why are part of the whole business
economics. Having a holistic view keeps us from missing the forest due
to the trees.
148
Operations Research (OR), or operational research in the U.K, is a
discipline that deals with the application of advanced analytical methods
to help make better decisions. The terms management science and
analytics are sometimes used as synonyms for operations research. Yet,
in my experience OR extends far beyond either. The list below is a
collection of operations research activities – I’ll let you decide if they are
also performed in analytics.
149
airlines, train lines, logistic systems, delivery systems (e.g., FedEx),
defense systems, military, oil companies, insurance companies, financial
institutions, manufacturing, marketing and many more.
An OR’s view of the problem space is really what defines them and
describes what they do. The list above displayed some of the activities
that ORs engage in, but not without a holistic view of the problem space.
Figure 1 depicts the entire problem space. Mathematically, we could
look at it like this:
150
({(Analysis Space)⊂Research Space}⊂Operations Space)⊂Problem
Space
The OR Analyst must enter the problem space with the following in
mind: (1) the potential operational domains, (2) the types of research
that may be used, and (3) the types of analyses that may be appropriate.
If one goes in having done nothing more than math programming for 10
years, that analyst is NOT an operations research analyst—they are just
a math programmer.
Reference
[1] MacGyver is an American action-adventure television series created
by Lee David Zlotoff. Henry Winkler and John Rich were the executive
producers. The show follows secret agent Angus MacGyver, played by
Richard Dean Anderson, who works as a troubleshooter for the fictional
Phoenix Foundation in Los Angeles and as an agent for a fictional United
States government agency, the Department of External Services (DXS).
Resourceful and possessed of an encyclopedic knowledge of the physical
sciences, he solves complex problems with everyday materials he finds
at hand, along with his ever-present duct tape and Swiss Army knife.
151
Part VIII - Tools
"Over the years, respondents to the 2007-2011 Data Miner Surveys have
shown increasing use of R. In the 5th Annual Survey (2011) we asked R
users to tell us more about their use of R. The question asked, “If you
use R, please tell us more about your use of R. For example, tell us why
you have chosen to use R, why you use the R interface you identified in
the previous question, the pros and cons of R, or tell us how you use R
in conjunction with other tools.” 225 R users shared information about
152
their use of R. They provided an enormous wealth of useful and detailed
information. Below are the verbatim comments they shared."
154
Furthermore, your scripts can tend to be a bit messy if you are not
sure what kind of analysis or models you are going to use.
I use R for the diversity of its algorithms, packages. I use Emacs for
other tasks and it’s a natural to use it to run R, and Splus for
that matter. I usually do data preparation in Splus, if the technique
I want to use is available in Splus I will do all the analysis in
Splus. Otherwise I’ll export the data to R, do the analysis in R, export
results to Splus where I’ll prepare tables and graphs for
presentations of the model(s). The main drawback to R, in my
opinion, is that R loads in live memory all the work space it is linked
to which is a big waste of time and memory and makes it difficult to
use R in a multi-users environment where typical projects consist of
several very large data sets.
We continue to evaluate R. As yet it doesn’t offer the ease of use
and ability to deploy models that are required for use by our
internationally distributed modeling team. “System” maintenance
of R is too high a requirement at the moment and the enormous
flexibility and range of tools it offers is offset by data handling
limitations (on 32 bit systems) and difficulty of standardizing quick
deployment solutions into our environment. But we expect to
continue evaluation and training on R and other open source
tools. We do, for instance, make extensive use of open source ETL
tools.
I use R extensively for a variety of tasks and I find the R GUI the most
flexible way to use it. On occasion I’ve used JGR and Deducer, but
I’ve generally found it more convenient to use the GUI. R’s strengths
are its support network and the range of packages available for it
and its weaknesses are its ability to handle very large datasets and,
on occasion, its speed. More recently, with large or broad datasets
I’ve been using tools such as Tiberius or Eureqa to identify important
variables and then building models in based on the identified
variables.
I’m using R in order to know why R is “buzzing” in analytical areas
and to discover some new algorithms. R has many problems with
big data, and I don’t really believe that Revolution can effectively
support that. R language is not mature for production, but really
efficient for research: for my personal researches, I also use SAS/IML
programming (which is for me the real equivalent for R, not
SAS/STAT). I’m not against R, it’s a perfect tool to learn statistics,
but I’m not really for data mining: don’t forget that many techniques
155
used in data mining comes from Operational Research, in
convergence with statistics. Good language, but not really
conceived for professional efficiency.
R is used for the whole data loading process (importing, cleaning,
profiling, data preparation), the model building as well as for
creating graphical results through other technologies like
Python. We use it also using the PL/R procedural language to do in-
database analytics & plotting.
Utilize R heavily for survey research sampling and analysis and
political data mining. The R TextMate bundle is fantastic although
RStudio is quickly becoming a favorite as well. Use heavily in
conjunction with MySQL databases.
I use R in conjunction with Matlab mostly, programming my
personalized algorithms in Matlab and using R for running statistical
test, ROC curves, and other simple statistical models. I do this since
I feel more comfortable with this setting. R is a very good tool for
statistical analysis basically because there are many packages
covering most of statistical activities, but still I find Matlab more easy
to code in.
I greatly prefer to use R and do use it when working on more
“research” type projects versus models that will be reviewed and
put into production. Because of the type of work that I do, SAS is the
main software that everyone is familiar with and the most popular
one that has a strong license. Our organization needs to be able to
share across departments and with vendors and governmental
divisions. We can’t use R as much as I would like because of how
open the software is – good for sharing code, bad for ensuring
regulators that data is safe.
Main reasons for use: 1) Strong and flexible programming language,
making it very flexible. 2) No cost, allowing me to also having it on
my personal computer so that I can test things at home that I later
use at work. I use RODBC to get the data from server and let the
server do some of the data manipulation, but control it from within
R. Have also started to use RExcel with the goal as using that as a
method to deploy models to analysts more familiar with Excel than
R.
Personally I find R easier to use than SAS, mostly because I am not
constrained in getting where I want to go. SAS has a
canned approach. I see using GUI’s as a “sign of weakness” and as
preventing understanding the language at its core. I have not found
156
Rattle to be particularly helpful. I have also tried JGR and Sciviews
and found I could not surmount the installation learning curve. Their
documentation did not produce a working environment for me.
157
Why You Might Use SAS
I write a good bit of content about using open-source tools for analytics
and operations research. However, my workhorse happens to be SAS.
Actually, I use SAS Enterprise Guide (EG) and SAS Enterprise Miner (EM).
158
data imputations, and data transformations prior to running a model,
like logistic regression. When I get an adequate model, I take the scoring
code from EM and port it to EG, wrapped in a macro. I run the macro in
EG to measure model performance. When I get a model that performs
well enough against challenger models, I develop model production
code in EG, which includes the EM scoring code.
Conclusion
I will not argue that either SAS or open-source tools are better than the
other. Instead I will state that they each have their use in various
situations. Yet, SAS is tried and true…and I will continue to use both.
159
What is BOARD BEAM ?
Is it a beam supporting a roof? Is it a board made from Canadian timber?
Is it capable of leaping a tall building in a single bound? No, it is a tool
for performing analytics! BOARD is the company and software platform
and BEAM is a software module: BOARD Enterprise Analytics Modeling
(BEAM).
Disclaimer
I am not a spokes-person, customer, or user of BOARD BEAM, but I think
that it is worthy enough of our attention to write about it here.
What is it?
You could probably compare BOARD BEAM most closely with Tableau,
but I will not do so here. I'll just tell you what I have seen with BOARD
BEAM. At first glance, it integrates advanced and predictive analytics
with business intelligence and performance management. It's
capabilities range from analytics reporting to predictive analytics using
time series models. It is also easy to use. I have spoken with project
managers who use it for powerful business insights and take action, and
160
data analyst who use it for reporting medical procedure information in
healthcare to insurance providers.
You can automatically group your customers, products etc. into clusters
and immediately use them as analysis dimensions in your Business
Intelligence environment. Using k-means as its basis, clustering can be
performed with a few clicks of the mouse. There is no coding involved,
and if you can spell "K-means" correctly, you can probably perform it as
well.
161
How can I get it?
You can test-drive BOARD BEAM by visiting
http://www.board.com/us/.
162
R, SPSS Modeler and Half-Truths
The Ad
"Please join us on May 26th, 2015 at 9am PDT for our latest Data Science
Central Webinar Event: 7 Reasons to combine SPSS Statistics and R
sponsored by IBM."
The Claim
The ad does on to claim: “According to the Rexer survey,* R is the
analytic software of choice for data scientists, business analysts, and
data miners in the corporate world. Despite R's popularity, adoption of
R has lagged due to a few limitations like:”
163
Data Complexity - R does not easily connect to databases
natively.
Output - Production of publish-ready output is difficult.
Performance & Scalability - R can very quickly consume all
available memory.
Collaboration - R makes sharing work among an analyst team
difficult, especially when team members do not have the same
level of R knowledge.
Enterprise security - The security of the packages that you
download is not assured.
The Survey
I posted partial results of Rexer survey yesterday on bicorner.com and
today on LinkedIn. I have read the entire survey of about 225 responses.
I would not draw the same conclusions entirely, unless I worked for IBM.
IMB’s Motive?
It is of course in IBM’s best interest to draw unsupported conclusions to
support their product, SPSS Modeler. However, what they do not tell you
is that SPSS Modeler is not that user friendly either. Modeler alone lacks
functionality that could be provided by SPSS Stats, but you have to pay
extra for that. Consequently, building a model for measuring the
information value of variables is cumbersome in SPSS Modeler, but is
quite easy in SAS Enterprise Guide.
164
Half-Truths
What bothers me about this is the half-truths that are espoused by the
ad (and to me half of a truth is the equivalent of a lie). You could read
the results of the survey and use them to justify purchasing a red
Mustang convertible just as easily. This is not to say that there are not
some fundamental issues with relying on R solely for your analytics. But,
some of their “bold” conclusions are not accurate. For instance, “R is not
easy to learn for everyone. Not everyone is a programmer,” is true;
however, to do anything other than very basic modeling with SPSS
Modeler, you have to write scripts, and the scripting language is more
difficult that R’s programming language. Plus the documentation on the
scripting language is almost non-existent.
Conclusion
Nevertheless, it will probably be a good webinar. It is just lame that you
have to tell half-truths to advertise it. Finally, if you want SPSS
Modeler to be better, include SPSS Stats functionality for the modeling
blocks included in Modeler, and work on you documentation, a lot!
165
An Analytics Best Kept Secret
I have written about R. I have told you about Octave. Little did I know
that I had not found the motherlode. Thanks to my friend and colleague
Dan Pompea, I now have discovered SciLab.
If MATLAB has any advantage over this open-source gem, I have yet to
find it (at least for use in operations research and analytics). Not even
Simulink is worthy of mention in the same breath. I have used MATLAB
extensively. It was the foundation of my bestselling book, Missile Flight
Simulation. Now I must turn my back to this industry-standard
workhouse and its ferocious price tag to this newcomer and it's
incredibly lost cost: a donation if you choose.
I have yet to explore its vast capabilities but what I read is remarkable.
However, I have carefully stressed its modeling and simulation features,
and I say, "Farewell poor Simulink, I knew you so well." I am so sorry for
your loss, dear Octave, for you were becoming a shining star. But now
there is a new kid in town, whose glimmer is brighter than your own.
166
To ExtendSim, I say, "I enjoyed our time together, and the role you
played in my second best-seller, Discrete Event Simulation using
ExtendSim." To Arena Simulation, which I introduced to Army
Operations Research ten year ago, I say, "Farewell trusted companion. I
hope your future is bright."
167
What is Linear Programming?
Linear programming (LP) is a tool for solving optimization problems. In
1947, George Dantzig developed an efficient method, the simplex
algorithm, for solving linear programming problems (also called LP).
Since the development of the simplex algorithm, LP has been used to
solve optimization problems in industries as diverse as banking,
education, forestry, petroleum, and trucking. In a survey of Fortune 500
firms, 85% of the respondents said they had used linear programming.
As a measure of the importance of linear programming in operations
research, approximately 70% of this book will be devoted to linear
programming and related optimization techniques.
168
Step 1. Find the decision variables, i.e. find out what are the variables
whose values you can choose.
Step 2. Find the objective function, i.e. find out how your objective to
be minimized or maximized depends on the decision variables.
Step 3. Find the constraints, i.e. find out the (in) equalities that the
decision variables must satisfy.(Don’t forget the possible sign
constraints!)
This is not the best example in the world, because we can solve this
without a linear program. But it is one that makes illustration of the
concept simple. Let’s say we want to maximize Profit on a certain
product. Profit would be our decision variable. In simplest terms, profit
is a function of revenue minus costs, so we could write the objective
function as
max P = R – C
169
s.t.:
We would then code this into a software program like Excel, LINDO,
MATLAB, etc. and let the software solve the LP. Though there are many
other technical details associated with arriving at a solution, this is
basically what linear programing is all about. We could make this
problem more interesting by allowing the number of items produce vary
between zero and its upper bound of 250.
MATLAB Code
This is an LP with six variables and seven constraints. The code consist of
defining three matrices, A, b and c. The code that executes the LP is the
line at the bottom. It happens to be a minimize LP. glpk is the function
that executes the LP, the zeros are lower bounds of the variables, "L"
indicates the constrains have lower bounds, "C" indicates the variable
are continuous, and "1" tells the program to minimize. The c matrix
represents the objective function, While A and b constitute the
constraints. This also runs in an open source program called Octave.
c=[-1, 0, 0, 0, 0, 1];
A=[-1, 0, 1, 0, 0, 0;
-1, 1, 0, 0, 0, 0;
0, 0, -1, 0, 1, 0;
0, 0, -1, 1, 0, 0;
0, 0, 0, -1, 1, 0;
0, 0, 0, 0, -1, 1;
0, -1, 1, 0, 0, 0];
b=[6; 9; 8; 7; 10; 12; 0];
[x_min,z_min,status,extra]=glpk(c,A,b,[0;0;0;0;0;0],[
],"LLLLLLL","CCCCCC",1)
Conclusion
LPs are used for optimizing schedules, networks, transpiration
systems, delivery systems, manufacturing, investing, and so on.
170
These will often be comprised of many variables and constraints.
LPs can be used to optimize a project schedule in Project
Management. Operations research analysts are specifically trained
to formulate and solve these kinds of problems.
171
Part IX - Advice
1. When dealing with “big data” (#big#data), about 2/3 of you project
time is spent getting access to the data, getting the right data,
preprocessing the data, and exploratory data analysis, prior to
modeling building.
2. Presentation of the results are as important as the analytics
performed. If you cannot convey the results clearly and succinctly,
you’re not adding value for your customer. In particular, your results
must be translated to economic value or a similar metric.
3. Everything benefits from peer reviews, including the 2/3 work
performed before modeling (See #1). You should develop a template
for conducting peer reviews so that they are thorough and
consistent.
172
4. You should attempt to validate everything. This includes any
requirements, data provided (response data, etc.), macros provided,
etc. That is to say, not just your model.
5. Listen carefully to the customer and help them express their “real”
problem. Often the customer knows they have a problem but either
cannot express it or they do not identify the root problem.
6. Ensure that the variables in a model are intuitive and limit them to
the most predictive variable. Rule of thumb: not more than ten
variables.
7. Document your work as you progress. Either keep a notebook for
your project or record it electronically. I use a spreadsheet with a tab
for each task or sub-task. This may be used later for a writing
development document or providing information for validation.
About the picture: The lessons learned on the picture are seven of ten
staff rules I learned from the executive officer when I was a personnel
officer in 3rd Squadron 2nd Armored Cavalry Regiment. Note: Charlie
Beckwith commanded Desert One during Operation Eagle Claw (Google
it).
173
Here’s one more the thing…advice
for young people
In this article I want to add to "Here’s the thing…more advice for young
analysts", based on questions I continue to receive through LinkedIn
messages and e-mail. I also want to extend this beyond advice to
analysts, and include anyone who may have an ear to hear.
Here’s the thing…Not everyone can write. If you are one of these, then
read and comment on other people's articles/blogs. Better yet,
comment and share both on LinkedIn and Twitter. Every LinkedIn post
you read has an icon (under the title) for LinkedIn, Twitter, Google+ and
Facebook. Sharing helps the author get more exposure, and the person
who shares. Also, reading more may also make you a better writer.
174
Here’s the thing…There are two ways to do things: right and over. Even
the best of us make mistakes. I can make a mistake in the blink of an eye,
even though I have been at my profession for over 20 years. So when I
do not get it right, I own up to my mistake and do it (whatever task I was
trying to perform) OVER! I suppose that leads to some advice for not so
young people: allow your young employees to make mistakes and have
them do it over. This leads to learning and better workers.
Here’s the thing…I have to repeat this one: some jobs require a Ph.D.,
like a professor at an accredited university. (You can teach without a
Ph.D., just not as a professor—I know a really great guy who has been
doing this for years.) Most jobs do not require a Ph.D. However, some
people feel like they will never be of much worth without it. And here is
the crux of the matter:
“If you are not god enough without it, you will never be good enough
with it.”
Here’s the thing…Some of the very best Data Scientists and Analytic
Professionals that I know do not have Ph.D.’s, and hardly anyone
addresses me a Doctor (usually just “Hey you!”).
Here’s the thing…This one is also worth repeating: there are many
online courses and tutorials that can provide you with the training you
need without getting another degree. For young analysts, I refer you to
another bicorner.com article I wrote recently: Online Education in
Analytics and Data Science. In general, Coursera offers many courses
with flexible scheduling at no cost. The key here is to become a "lifelong
learner", constantly seeking self-improvement. Oh, and it shows a little
initiative, not waiting on the company to train you. Most of them do not
offer much training anyway.
Here’s the thing…Know who you are competing with and (here's the
shocker) help them! This can back-fire, but a good supervisor will take
note of the one who is helping their colleagues and might say, "I see
leadership potential there". There are two ways of doing this (and they
are not "right" and "over"): with fanfare and without it. One way to
openly do this is searching for people on LinkedIn who have the “title”
or “professional headline” that you are interested in. Study profiles to
175
see what these people are doing and have done and connect with them.
Oh, and I mean CONNECT with them, not just add someone to your
network. In fact, go back to my second "Here's the thing..." and share
what other people are doing. At work, I would do it with less fanfare--
not that I would hide it, but I wouldn't go to extremes in making sure my
help was noticed.
176
Index
@
@Risk .......................................................................................................... 17, 44
“
“what if” analysis .............................................................................................. 61
A
acquisition ............................................................... 14, 18, 50, 54, 128, 134, 160
ad hoc procedures ...................................................................................... 10, 28
Alan Turing ...................................................................................................... 123
All Data ...................................................................................................... 97, 101
alternative hypothesis ...................................................................................... 91
Analytical Customer Relationship Management .................................... 9, 14, 21
analytical solution ............................................................................................. 11
Analytics .. 3, 7, 8, 9, 11, 14, 15, 16, 17, 18, 28, 29, 31, 43, 44, 51, 52, 56, 74, 76,
82, 83, 90, 91, 92, 93, 94, 95, 107, 113, 114, 131, 132, 133, 134, 142, 150,
155, 157, 159, 161, 162, 166, 167, 173, 175
ANN ..................................................................................... 40, 42, 66, 67, 68, 69
attrition ................................................................................... 15, 39, 40, 42, 130
auto-neural network ............................................................................... See ANN
B
Big Data ..........................................................2, 3, 4, 5, 18, 36, 97, 102, 109, 130
BOARD....................................................................................... 44, 161, 162, 163
bottlenecks ....................................................................................................... 83
Business analytics ............................................................................................... 7
business case .................................................................. 26, 29, 32, 34, 146, 147
business rules ................................................................................................ 9, 13
177
C
call center .................................... 35, 56, 57, 63, 71, 72, 129, 134, 135, 136, 147
campaign ...............................................................................................49, 50, 56
Capital asset pricing model .............................................................................. 17
channels ................................................................................................16, 54, 55
Clarity Solution Group .................................................................................... 173
classification trees .......................................................................................40, 42
Clinical Decision Support Systems .................................................................... 14
clustering ......................................................................... 29, 57, 58, 59, 101, 162
Colorado Trail ........................................................................................82, 83, 84
control group...............................................................................................49, 50
Coursera ............................................................................ 29, 102, 107, 114, 176
cross-sale .......................................................................................................... 15
cross-sell ........................................................................................................... 14
customer base .............................................................................................14, 54
Customer Lifetime Value .................................................................................. 54
customers .. 8, 9, 12, 13, 14, 15, 16, 18, 26, 29, 33, 34, 45, 47, 48, 50, 53, 54, 55,
56, 57, 59, 72, 76, 77, 82, 86, 87, 88, 89, 113, 125, 126, 129, 134, 135, 147,
161, 173, 174
D
data mining ..................................................................... 114, 115, 142, 150, 153
data patterns .................................................................................................8, 12
data reduction ................................................................................... 4, 30, 33, 95
Data Science ii, 10, 11, 29, 36, 37, 42, 94, 95, 102, 103, 104, 105, 108, 114, 116,
118, 164, 176
data scientist .... 36, 104, 105, 106, 107, 108, 111, 112, 113, 114, 116, 117, 118,
119, 122, 123, 124, 125
decision making......................................................................... 7, 10, 12, 14, 132
decision models................................................................................................ 13
decision trees ................................................................................................... 19
descriptive analytics ........................................................................................... 8
178
descriptive models .................................................................................. 8, 13, 51
deterministic ............................................................................................... 64, 65
deterministic model .......................................................................................... 65
direct marketing.......................................................................................... 14, 16
discrete event simulation ........................... 43, 44, 61, 62, 71, 72, 134, 135, 168
Distributed Interactive Simulation .................................................................... 63
E
explanatory variables .................................................................................. 11, 12
ExtendSim ................................................................................................. 44, 168
F
Facebook ......................................................................................................... 175
false positive ............................................................................................... 84, 91
File of Dreams ................................................................................................... 45
Financial Services and Insurance ...............................................................See FSI
forecasting model ........................................................................... 12, 38, 40, 42
fraud...................................................................................... 8, 11, 12, 16, 17, 92
FSI............................................................................................ 134, 159, 160, 173
funnel ................................................................................................................ 47
G
game theory ...................................................................................... 43, 137, 138
general equilibrium theory ..................................................................... 137, 138
George Box........................................................................................................ 24
Google+ ........................................................................................................... 175
H
Hadoop.................................................................................. 18, 30, 44, 115, 129
heuristics ......................................................................................... 11, 17, 25, 92
Hierarchical Optimal Discriminant Analysis ...................................................... 19
High Level Architecture ..................................................................................... 63
Human Resources ................................................................... 130, 131, 132, 133
179
I
inferential statistics .......................................................................................... 90
Informatics ..................................................................................................... 111
Information Value ............................................................................................ 98
INFORMS ...................................................................................................52, 119
integration .......................................................................................112, 117, 118
Internet of Things ........................................................................................... 117
J
jargon ......................................................................................................126, 128
Jessica Rabbit ................................................................................................... 86
John Nash ....................................................................................................... 139
K
Key Performance Parameter .......................................................................... 147
K-Nearest Neighbor .......................................................................................... 19
Kurt Vonnegut .....................................................................................75, 88, 131
L
likelihood ........................................................... 8, 12, 13, 15, 45, 54, 55, 56, 128
linear programming.............................................................................29, 44, 169
linear regression ............................................................................................... 19
LinkedIn ............................................................................... 37, 55, 165, 175, 176
logic regression ................................................................................................ 19
logistic regression .......................................... 11, 19, 29, 38, 65, 69, 74, 128, 160
M
machine learning ..... 8, 10, 11, 12, 17, 18, 24, 29, 35, 38, 51, 66, 67, 91, 92, 105,
108
maintainability ....................................................................................62, 72, 135
MapReduce ...................................................................................................... 18
marketing ...... 8, 9, 10, 12, 14, 32, 42, 49, 50, 54, 56, 92, 95, 107, 128, 129, 134,
145, 151, 159, 171
mathematical modeling ..............................................................................29, 44
180
MATLAB ...................................................................................... 43, 44, 167, 171
mean ................................................................................................................. 84
metric ............................................................................ 34, 50, 84, 146, 147, 173
model 8, 10, 11, 12, 13, 16, 19, 24, 25, 26, 27, 30, 31, 32, 33, 34, 35, 36, 37, 38,
40, 47, 48, 50, 54, 55, 56, 57, 60, 61, 62, 63, 69, 74, 98, 123, 125, 126, 128,
134, 136, 138, 147, 154, 156, 157, 160, 165, 174
modeler judgment ................................................................................ 11, 12, 33
Monte Carlo ................................................................................................ 43, 61
Multivariate adaptive regression splines .......................................................... 19
mutual defection............................................................................................. 140
N
Naïve Bayes ....................................................................................................... 19
NASA ................................................................................................... 3, 4, 42, 88
Nash equilibrium ............................................................................................. 140
neural networks .................................................................. 19, 38, 40, 51, 67, 68
null hypothesis .................................................................................................. 91
O
open-source .............................................................. 44, 154, 159, 160, 166, 167
Operation Eagle Claw ................................................................................ 87, 174
Operation Overlord ........................................................... 37, 119, 145, 150, 152
operational research ...................................................... See operations research
operations research .. xvi, 7, 11, 24, 28, 31, 32, 34, 37, 42, 52, 75, 76, 77, 92, 94,
95, 107, 119, 120, 142, 143, 144, 145, 150, 151, 152, 159, 167, 168, 169
optimization ...............................12, 13, 29, 43, 92, 119, 143, 145, 150, 151, 169
Otave................................................................................................................. 44
P
Point of Sell ....................................................................................................... 56
predictive analytics 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 28, 44, 53, 56, 57, 73,
74, 92, 94, 118, 161
predictive modeling.................. 7, 8, 10, 12, 23, 28, 36, 37, 53, 91, 119, 159, 173
predictive models .................. 8, 10, 12, 13, 15, 24, 36, 38, 41, 49, 51, 52, 53, 80
181
prescriptive analytics ......................................................................................8, 9
Probabilistic Risk Assessment .......................................................................... 17
probit regression .............................................................................................. 19
problem solving .................................................................................75, 120, 121
problem space .........................................................................................151, 152
profit .................................................................................... 32, 34, 113, 169, 170
propensity model ............................................................................................. 45
propensity models............................................................................................ 53
propensity to buy ........................................................ 24, 35, 47, 55, 65, 98, 128
propensity to churn .......................................................................................... 56
propensity to engage ..................................................................................42, 54
propensity to unsubscribe................................................................................ 55
Python ......................................................................... 30, 97, 113, 124, 129, 157
Q
quantile regression........................................................................................... 19
R
R 154, 164
random forests ............................................................................................40, 42
regression .............................. 12, 18, 19, 29, 38, 42, 51, 69, 76, 80, 95, 101, 105
regression trees ................................................................................................ 19
reliability ....................................................................................... 10, 62, 72, 135
requirements ................................... 29, 32, 33, 83, 102, 108, 111, 118, 119, 174
retention .................................................................................................9, 14, 15
Return on Investment .................................................................................... 147
revenue ........................................................................................ 54, 55, 138, 170
Rexer Analytics ............................................................................................... 153
ridge regression ................................................................................................ 19
R-Studio ...............................................................................................29, 58, 155
182
S
SAS ................... 21, 29, 30, 97, 113, 115, 128, 129, 155, 156, 157, 159, 160, 165
SAS Enterprise Modeler .............................................................................. 21, 29
scientist .................. 23, 36, 37, 106, 108, 113, 116, 119, 121, 123, 124, 125, 164
Scilab ................................................................................................................. 44
SCILAB ..................................................................................................... 167, 168
segmentation .............................................................................................. 12, 57
Share of Wallet ................................................................................. 54, 146, 147
simulation ..... 3, 10, 29, 44, 60, 61, 62, 63, 64, 73, 74, 75, 76, 77, 118, 119, 121,
145, 150, 167, 173
SPSS Modeler ...................................................................... 21, 29, 113, 165, 166
SQL .................................................................................................... 30, 113, 154
statistical model ........................................................................ 11, 24, 25, 38, 60
statistical techniques ...................................................... 8, 10, 11, 12, 23, 29, 80
statistician ................................................. 11, 79, 92, 94, 95, 103, 105, 121, 123
Statistics .. 8, 10, 11, 23, 28, 29, 36, 37, 41, 78, 79, 80, 90, 91, 92, 93, 94, 97, 98,
100, 101, 102, 104, 111, 156, 157
Steve Cartwright ............................................................................................... 47
stochastic processes ................................................................................... 61, 64
Structured Query Language ..................................................................... See SQL
T
target ................................................................................................................ 16
The Prisoner’s Dilemma .................................................................................. 139
time-series model ............................................................................................. 38
training sample ................................................................................................. 12
treated group .................................................................................................... 49
Twitter............................................................................................................. 175
U
unstructured data ................................................................................. 9, 18, 130
Uplift model ................................................................................................ 49, 50
183
V
validation ..................................................................................... 20, 26, 141, 160
Vince Roske .................................................................................................... 146
W
waiting time ......................................................................................72, 135, 136
Weight of Evidence .......................................................................................... 98
184