Demgn801 Business Analytics 76 150
Demgn801 Business Analytics 76 150
Trend analysis means determining consistent movement in a certain direction. There are two types
of trends: deterministic, where we can find the underlying cause, and stochastic, which is random
and unexplainable.
Seasonal variation describes events that occur at specific and regular intervals during the course of
a year. Serial dependence occurs when data points close together in time tend to be related.
Time series analysis and forecasting models must define the types of data relevant to answering the
business question. Once analysts have chosen the relevant data they want to analyze, they choose
what types of analysis and techniques are the best fit.
Important Considerations for Time Series Analysis
While time series data is data collected over time, there are different types of data that describe how
and when that time data was recorded. For example:
Time series data is data that is recorded over consistent intervals of time.
Cross-sectional data consists of several variables recorded at the same time.
Pooled data is a combination of both time series data and cross-sectional data.
Time Series Analysis Models and Techniques
Just as there are many types and models, there are also a variety of methods to study data. Here are
the three most common.
Box-Jenkins ARIMA models: These univariate models are used to better understand a single time-
dependent variable, such as temperature over time, and to predict future data points of variables.
These models work on the assumption that the data is stationary. Analysts have to account for and
remove as many differences and seasonalities in past data points as they can. Thankfully, the
ARIMA model includes terms to account for moving averages, seasonal difference operators, and
autoregressive terms within the model.
Box-Jenkins Multivariate Models: Multivariate models are used to analyze more than one time-
dependent variable, such as temperature and humidity, over time.
Holt-Winters Method: The Holt-Winters method is an exponential smoothing technique. It is
designed to predict outcomes, provided that the data points include seasonality.
Business Analytics
Visualization: Finally, you can use various visualization techniques to explore the time series data
further. This can include plotting the data with different levels of smoothing or using techniques
such as time series forecasting to predict future values.
Overall, R provides a wide range of tools and techniques for exploring and analyzing time series
data, making it a powerful tool for time series analysis.
Time Series in R is used to see how an object behaves over a period of time. In R, it can be easily
done by ts() function with some parameters. Time series takes the data vector and each data is
connected with timestamp value as given by the user. This function is mostly used to learn and
forecast the behavior of an asset in business for a period of time. For example, sales analysis of a
company, inventory analysis, price analysis of a particular stock or market, population analysis, etc.
Syntax:
objectName <- ts(data, start, end, frequency)
where,
data represents the data vector
start represents the first observation in time series
end represents the last observation in time series
frequency represents number of observations per unit time. For example, frequency=1 for monthly
data.
Note: To know about more optional parameters, use the following command in R console:
help("ts")
Example: Let’s take the example of COVID-19 pandemic situation. Taking a total number of
positive cases of COVID-19 cases weekly from 22 January 2020 to 15 April 2020 of the world in
data vector.
# Weekly data of COVID-19 positive cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="timeSeries.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
col.main ="darkgreen")
# saving the file
dev.off()
Business Analytics
Business Analytics
capture the volatility clustering and persistence in financial data by allowing the variance of the
series to be time-varying.
GARCH models build upon the autoregressive conditional heteroskedasticity (ARCH) framework,
which models the conditional variance of a series as a function of its past squared residuals. The
GARCH model extends this by including both past squared residuals and past conditional
variances in the conditional variance equation. This allows for the persistence of volatility to be
captured in the model.
There are several variations of GARCH models, including the original GARCH, the EGARCH
(Exponential GARCH), the TGARCH (Threshold GARCH), and the IGARCH (Integrated GARCH).
These models differ in how they model the conditional variance and incorporate additional features
such as asymmetry, leverage effects, and long-term memory.
GARCH models are estimated using maximum likelihood estimation, and model selection can be
done using criteria such as Akaike Information Criterion (AIC) or Bayesian Information Criterion
(BIC). Once a GARCH model has been estimated, it can be used for forecasting future volatility and
assessing risk.
GARCH models have become popular in finance because they are able to capture the complex
dynamics of financial data and provide accurate estimates of future volatility. However, they can be
computationally intensive to estimate and may require large amounts of data to achieve accurate
results. Additionally, GARCH models are limited by their assumption of normality in the
distribution of residuals, which may not be appropriate for all types of financial data.
Example
# Load the necessary packages
library(tseries)
library(rugarch)
# Load the data
data(msft)
# Estimate a GARCH(1,1) model
spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1,1)), mean.model =
list(armaOrder = c(0,0)))
fit <- ugarchfit(spec, data = msft$Close)
# Print the model summary
summary(fit)
# Make a forecast for the next 5 periods
forecast <- ugarchforecast(fit, n.ahead = 5)
# Plot the forecasted volatility
plot(forecast)
In this example, we load the msft data from the tseries package, which contains the daily closing
prices of Microsoft stock from January 1, 2005 to December 31, 2010. We then specify a GARCH(1,1)
model using the ugarchspec function from the rugarch package. The variance.model argument
specifies that we want to use a standard GARCH model (model = "sGARCH") with a lag of 1 for
both the autoregressive and moving average components (garchOrder = c(1,1)). The mean.model
argument specifies that we want to use a zero-mean model (armaOrder = c(0,0)).
We then fit the model to the data using the ugarchfit function and print the model summary. The
summary displays the estimated parameters of the model, as well as diagnostic statistics such as the
log-likelihood and the AIC.
Next, we use the ugarchforecast function to make a forecast for the next 5 periods. Finally, we plot
the forecasted volatility using the plot function.
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 27.60041 0.042660 647.008 0.00000
omega 0.06801 0.003504 19.419 0.00000
alpha1 0.10222 0.010072 10.153 0.00000
beta1 0.88577 0.006254 141.543 0.00000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 27.60041 0.031563 874.836 0.00000
omega 0.06801 0.004074 16.691 0.00000
alpha1 0.10222 0.010039 10.183 0.00000
beta1 0.88577 0.008246 107.388 0.00000
LogLikelihood : 2321.214
Information Criteria
------------------------------------
Akaike -13.979
Bayes -13.952
Shibata -13.979
Hannan-Quinn -13.969
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.525 0.468736
Lag[2*(p+q)+(p+q)-1][4] 1.705 0.074600
Lag[4*(p+q)+(p+q)-1][8] 2.396 0.001582
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.007 0.933987
Lag[2*(p+q)+(p+q)-1][4] 2.710 0.244550
Business Analytics
data("economics")
data_ts <- ts(economics[, c("unemploy", "pop")], start = c(1967, 1), frequency = 12)
Before estimating the VAR model, let's plot the data to visualize any trends or seasonality.
autoplot(data_ts) +
labs(title = "Unemployment Rate and Total Population in the US",
x = "Year",
y = "Value")
plot
From the plot, we can see that both variables have a clear upward trend, so we will need to take
first differences to make them stationary.
data_ts_diff <- diff(data_ts)
Now that our data is stationary, we can proceed to estimate the VAR model.
Estimate the VAR model:
We will estimate a VAR model with 2 lags and a constant term using the VAR() function from the
vars library.
var_model <- VAR(data_ts_diff, p = 2, type = "const")
We can use the summary() function to view a summary of the estimated VAR model.
summary(var_model)
The output will show the estimated coefficients, standard errors, t-values, and p-values for each lag
of each variable in the VAR model.
Evaluate the VAR model:
To evaluate the VAR model, we can perform a serial correlation test and an ARCH test using the
serial.test() and arch.test() functions from the vars library, respectively.
serial.test(var_model)
arch.test(var_model)
If the p-value of the serial correlation test is less than 0.05, it indicates the presence of serial
correlation in the residuals, which violates the assumptions of the VAR model. Similarly, if the p-
value of the ARCH test is less than 0.05, it indicates the presence of conditional heteroskedasticity
in the residuals.
Forecast using the VAR model:
To forecast future values of the variables, we can use the predict() function from the vars library.
var_forecast <- predict(var_model, n.ahead = 12)
This will generate a forecast for the next 12 months, based on the estimated VAR model. We can
visualize the forecast using the autoplot() function from the ggfortify library.Copy code
autoplot(var_forecast) +
labs(title = "Forecast of Unemployment Rate and Total Population in the US",
x = "Year",
y = "Value")
plot
This plot shows the forecasted values for the next 12 months for both variables in the VAR model.
We can see that the unemployment rate is expected to decrease slightly over the next year, while
the total population is expected to continue increasing.
Business Analytics
Summary
Business forecasting using time series involves using statistical methods to analyze historical data
and make predictions about future trends in business variables such as sales, revenue, and demand
for products or services. Time series analysis involves analyzing the pattern of the data over time,
including identifying trends, seasonal patterns, and cyclical fluctuations.
One popular approach to time series forecasting is the use of ARIMA (autoregressive integrated
moving average) models, which can capture trends and seasonal patterns in the data, as well as the
autocorrelation of the series. Another popular approach is the use of VAR (vector autoregression)
models, which can capture the interdependencies between multiple time series variables.
Business forecasting using time series can be used for a variety of purposes, such as predicting sales
for a specific product or service, forecasting future demand for inventory, and predicting overall
market trends. Accurate forecasting can help businesses make informed decisions about resource
allocation, inventory management, and overall business strategy.
To be effective, time series forecasting requires high-quality data, including historical data and
relevant external factors such as changes in the economy, weather patterns, or industry trends. In
addition, it's important to validate and test the accuracy of the forecasting models using historical
data before applying them to future predictions.
Overall, business forecasting using time series analysis can be a valuable tool for businesses looking
to make data-driven decisions and stay ahead of market trends.
Keywords
Time series: A collection of observations measured over time, typically at regular intervals.
Trend: A gradual, long-term change in the level of a time series.
Seasonality: A pattern of regular fluctuations in a time series that repeat at fixed intervals.
Stationarity: A property of a time series where the mean, variance, and autocorrelation structure
are constant over time.
Autocorrelation: The correlation between a time series and its own past values.
White noise: A type of time series where the observations are uncorrelated and have constant
variance.
ARIMA: A statistical model for time series data that includes autoregressive, differencing, and
moving average components.
Exponential smoothing: A family of time series forecasting models that use weighted averages of
past observations, with weights that decay exponentially over time.
Seasonal decomposition: A method of breaking down a time series into trend, seasonal, and
residual components.
Forecasting: The process of predicting future values of a time series based on past observations and
statistical models.
SelfAssessment
1. What is a time series?
A. A collection of observations measured over time
B. A statistical model for time series data
C. A method of breaking down a time series into trend, seasonal, and residual components
D. A type of time series where the observations are uncorrelated and have constant variance
5. What is ARIMA?
A. A statistical model for time series data that includes autoregressive, differencing, and
moving average components
B. A method of breaking down a time series into trend, seasonal, and residual components
C. A family of time series forecasting models that use weighted averages of past observations
D. A type of time series where the observations are uncorrelated and have constant variance
Business Analytics
11. What R function can be used to calculate the autocorrelation function (ACF) and partial
autocorrelation function (PACF) for a time series?
A. acf()
B. pacf()
C. auto.correlation()
D. corr()
6. D 7. B 8. A 9. C 10. B
Review Questions
1. What is a time series? How is it different from a cross-sectional data set?
2. What are some common patterns that can be observed in time series data?
3. What is autocorrelation? How can it be measured for a time series?
4. What is stationarity? Why is it important for time series analysis?
5. What is the difference between the additive and multiplicative decomposition of a time series?
6. What is a moving average model? How is it different from an autoregressive model?
7. What is the difference between white noise and a random walk time series?
8. How can seasonal patterns be modeled in a time series?
9. What is the purpose of the ARIMA model? How is it different from the ARMA model?
10. What is the purpose of the forecast package in R? What are some functions in this package that
can be used for time series analysis?
11. What is the purpose of cross-validation in time series analysis? How can it be implemented in
R?
12. What are some techniques for time series forecasting? How can they be implemented in R?
Further Reading
"Forecasting: Principles and Practice" by Rob J Hyndman and George Athanasopoulos -
This is an online textbook that covers the basics of time series forecasting using R. It
includes a lot of examples and code for different time series models, as well as practical
advice on how to apply them.
"Time Series Analysis and Its Applications: With R Examples" by Robert H. Shumway
and David S. Stoffer - This is a textbook that covers time series analysis and forecasting
using R. It covers a wide range of topics, from basic time series concepts to advanced
models and methods.
"Time Series Analysis in R" by A. Ian McLeod - This is a short book that covers the basics
of time series analysis in R. It includes examples of using R to analyze and model time
series data, as well as information on visualizing and interpreting time series plots.
Objective
After studying this unit, student will be able to
learn about the theory behind GLMs, including the selection of appropriate link functions
and the interpretation of model coefficients.
gain practical experience in data analysis by working with real-world datasets and using
statistical software to fit GLM models and make predictions.
to interpret the results of GLM analyses and communicate findings to others using clear
and concise language.
think critically and solve complex problems, which can help develop important skills for
future academic and professional endeavors.
Introduction
Business prediction using generalized linear models (GLMs) is a common technique in data
analysis. GLMs extend the linear regression model to handle non-normal response variables by
using a link function to map the response variable to a linear predictor.
In business prediction, GLMs can be used to model the relationship between a response variable
and one or more predictor variables. The response variable could be a continuous variable, such as
sales or revenue, or a binary variable, such as whether a customer will make a purchase or not. The
predictor variables could be various business metrics, such as marketing spend, website traffic,
customer demographics, and more.
Generalized linear models (GLMs) can be used for business prediction in a variety of applications
such as marketing, finance, and operations. GLMs are a flexible statistical modeling framework that
can be used to analyze and make predictions about data that have non-normal distributions, such
as counts, proportions, and binary outcomes.
Business Analytics
One common application of GLMs in business is to model customer behavior in marketing. For
example, a company might use GLMs to predict the likelihood of a customer responding to a
promotional offer, based on their demographic and behavioral data. This can help the company
optimize their marketing campaigns by targeting customers who are most likely to respond to their
offers.
GLMs can also be used in finance to predict the likelihood of default on a loan, based on the
borrower's credit history and other relevant variables. This can help banks and other financial
institutions make more informed lending decisions and manage their risk exposure.
In operations, GLMs can be used to predict the probability of defects or quality issues in a
manufacturing process, based on variables such as raw materials, production techniques, and
environmental factors. This can help companies optimize their production processes and reduce
waste and defects.
Overall, GLMs are a powerful tool for business prediction, providing a flexible and interpretable
framework for modeling a wide range of data types and outcomes.
Overall, linear regression is a useful and widely used technique in data analysis and prediction,
with many applications in business, economics, social sciences, and other fields.
Business Analytics
predictor variables. The coefficients of the GLM can be used to determine the direction and
magnitude of the effects of the predictor variables on the response variable.
In addition to logistic regression and Poisson regression, which are commonly used GLMs, other
types of GLMs include negative binomial regression, which can handle overdispersion in count
data, and ordinal regression, which can handle ordinal data.
GLMs require some assumptions, such as linearity between the predictor variables and the
response variable and independence of observations. Violations of these assumptions can lead to
biased or unreliable results.
Overall, GLMs are a useful and versatile statistical framework for modeling non-normally
distributed data, and can be applied in various fields. GLMs allow for the modeling of multiple
predictor variables and can be used for prediction and inference.
Suppose we want to model the relationship between a binary response variable (e.g., whether a
customer made a purchase or not) and several predictor variables (e.g., age, gender, income). We
can use logistic regression, a type of GLM, to model this relationship.
Data Preparation: First, we need to prepare our data. We will use a dataset containing information
on customers, including their age, gender, income, and whether they made a purchase or not. We
will split the data into training and testing sets, with the training set used to fit the model and the
testing set used to evaluate the model's performance.
Model Specification: Next, we need to specify the model. We will use logistic regression as our
GLM, with the binary response variable (purchase or not) modeled as a function of the predictor
variables (age, gender, income).
Model Fitting: We can fit the model to the training data using maximum likelihood estimation. The
coefficients of the logistic regression model can be interpreted as the log-odds of making a
purchase, given the values of the predictor variables.
Model Evaluation: We can evaluate the performance of the model using the testing data. We can
calculate metrics such as accuracy, precision, and recall to measure how well the model is able to
predict the outcome variable based on the predictor variables.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables or transforming the data using different link
functions.
Overall, building a GLM model involves data preparation, model specification, model fitting,
model evaluation, and model improvement. By following these steps, we can build a model that
accurately captures the relationship between the response variable and the predictor variables and
can be used to make predictions or draw inferences.
hypotheses about the effect of the predictor variables on the response variable, such as whether the
effect is significant or not.
Overall, logistic regression is a useful and widely-used statistical technique for modeling binary
response variables and can be applied in various fields. It allows for the modeling of the
relationship between the response variable and multiple predictor variables, and provides
interpretable coefficients that can be used to draw conclusions about the effects of the predictor
variables on the response variable.
Example
Sure, here's an example of using logistic regression to model a binary response variable:
Suppose we want to model the probability of a customer making a purchase based on their age and
gender. We have a dataset containing information on several customers, including their age,
gender, and whether they made a purchase or not.
Data Preparation: First, we need to prepare our data. We will split the data into training and testing
sets, with the training set used to fit the model and the testing set used to evaluate the model's
performance. We will also preprocess the data by encoding the gender variable as a binary
indicator variable (e.g., 1 for female and 0 for male).
Model Specification: Next, we need to specify the logistic regression model. We will model the
probability of making a purchase as a function of age and gender. The logistic regression model
takes the form:
log(p/1-p) = β0 + β1(age) + β2(gender)
where p is the probability of making a purchase, β0 is the intercept term, β1 is the coefficient for
age, and β2 is the coefficient for gender.
Model Fitting: We can fit the logistic regression model to the training data using maximum
likelihood estimation. The coefficients of the model can be interpreted as the change in log odds of
making a purchase for a one-unit increase in age or a change in gender from male to female.
Model Evaluation: We can evaluate the performance of the logistic regression model using the
testing data. We can calculate metrics such as accuracy, precision, and recall to measure how well
the model is able to predict the outcome variable based on the predictor variables.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables or transforming the data using different link
functions.
Overall, logistic regression is a useful technique for modeling binary response variables and can be
used in various fields. In this example, we used logistic regression to model the probability of a
customer making a purchase based on their age and gender. By following the steps of data
preparation, model specification, model fitting, model evaluation, and model improvement, we can
build a model that accurately captures the relationship between the response variable and the
predictor variables and can be used for prediction or inference.
Business Analytics
Data Preparation: First, we need to import our data into R. Let's assume our data is in a CSV file
named "car_ownership.csv". We can use the read.csv function to import the data:
car_data<- read.csv("car_ownership.csv")
Model Specification: Next, we need to specify the logistic regression model. We will use the glm
function in R to fit the model:
car_model<- glm(own_car ~ age + income, data = car_data, family = "binomial")
In this code, "own_car" is the response variable (binary variable indicating whether or not the
individual owns a car), and "age" and "income" are the predictor variables. The family argument
specifies the type of GLM to be fitted, in this case a binomial family for binary data.
Model Fitting: We can fit the model to our data using the summary function in R:
summary(car_model)
This will output a summary of the model, including the estimated coefficients and their standard
errors, as well as goodness-of-fit measures such as the deviance and AIC.
Model Evaluation: We can evaluate the performance of our model by calculating the predicted
probabilities of car ownership for each individual in our dataset:
car_prob<- predict(car_model, type = "response")
This will output a vector of predicted probabilities, one for each individual in the dataset. We can
then compare these probabilities to the actual car ownership status to evaluate the accuracy of the
model.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables, transforming the data, or using a different link
function.
Overall, performing logistic regression in R involves importing the data, specifying the model using
the glm function, fitting the model using the summary function, evaluating the model by
calculating predicted probabilities, and improving the model if necessary.
Example2
Here's an example of how to perform logistic regression in R using the built-in "mtcars" dataset:
Data Preparation: We can load the "mtcars" dataset and convert the response variable "am" (which
indicates whether a car has an automatic or manual transmission) to a binary indicator variable (0
for automatic and 1 for manual):
data(mtcars)
mtcars$am<- ifelse(mtcars$am == 0, 1, 0)
Model Specification: We can specify the logistic regression model using the "glm()" function:
model <- glm(am ~ hp + wt, data = mtcars, family = binomial)
In this model, we are predicting the probability of a car having a manual transmission based on its
horsepower ("hp") and weight ("wt").
Model Fitting: We can fit the logistic regression model using the "summary()" function to view the
estimated coefficients and their significance:
summary(model)
This will output the estimated coefficients for the intercept, "hp", and "wt", as well as their standard
errors, z-values, and p-values.
Model Evaluation: We can evaluate the performance of the logistic regression model using the
"predict()" function to obtain predicted probabilities for the testing data, and then calculate metrics
such as accuracy, precision, and recall to measure how well the model is able to predict the
outcome variable based on the predictor variables.
probs <- predict(model, newdata = mtcars, type = "response")
preds <- ifelse(probs > 0.5, 1, 0)
Business Analytics
Kaplan-Meier method
Cox Proportional hazard model
Kaplan-Meier Method
The Kaplan-Meier method is used in survival distribution using the Kaplan-Meier estimator for
truncated or censored data. It’s a non-parametric statistic that allows us to estimate the survival
function and thus not based on underlying probability distribution. The Kaplan–Meier estimates
are based on the number of patients (each patient as a row of data) from the total number of
patients who survive for a certain time after treatment. (which is the event).
We represent the Kaplan–Meier function by the formula:
Here S(t) represents the probability that life is longer than t with ti(At least one event happened), di
represents the number of events(e.g. deaths) that happened in time ti and ni represents the number
of individuals who survived up to time ti.
Example:
We will use the Survival package for the analysis. Using Lung dataset preloaded in survival
package which contains data of 228 patients with advanced lung cancer from North Central cancer
treatment group based on 10 features. The dataset contains missing values so, missing value
treatment is presumed to be done at your side before the building model.
# Installing package
install.packages("survival")
# Loading package
library(survival)
# Dataset information
?lung
# Fitting the survival model
Survival_Function = survfit(Surv(lung$time, lung$status == 2)~1)
Survival_Function
# Plotting the function
plot(Survival_Function)
Here, we are interested in “time” and “status” as they play an important role in the analysis. Time
represents the survival time of patients. Since patients survive, we will consider their status as dead
or non-dead(censored).
The Surv() function takes two times and status as input and creates an object which serves as the
input of survfir() function. We pass ~1 in survfit() function to ensure that we are telling the function
to fit the model on basis of the survival object and have an interrupt.
survfit() creates survival curves and prints the number of values, number of events(people
suffering from cancer), the median time and 95% confidence interval. The plot gives the following
output:
Here, the x-axis specifies “Number of days” and the y-axis specifies the “probability of survival“.
The dashed lines are upper confidence interval and lower confidence interval.
We also have the confidence interval which shows the margin of error expected i.e In days of
surviving 200 days, upper confidence interval reaches 0.76 or 76% and then goes down to 0.60 or
60%.
Example:
We will use the Survival package for the analysis. Using Lung dataset preloaded in survival
package which contains data of 228 patients with advanced lung cancer from North Central cancer
treatment group based on 10 features. The dataset contains missing values so, missing value
treatment is presumed to be done at your side before the building model. We will be using the cox
proportional hazard function coxph() to build the model.
Business Analytics
# Installing package
install.packages("survival")
# Loading package
library(survival)
# Dataset information
?lung
# Fitting the Cox model
Cox_mod<- coxph(Surv(lung$time, lung$status == 2)~., data = lung)
# Summarizing the model
summary(Cox_mod)
# Fitting survfit()
Cox <- survfit(Cox_mod)
Here, we are interested in “time” and “status” as they play an important role in the analysis. Time
represents the survival time of patients. Since patients survive, we will consider their status as dead
or non-dead(censored).
The Surv() function takes two times and status as input and creates an object which serves as the
input of survfir() function. We pass ~1 in survfit() function to ensure that we are telling the function
to fit the model on basis of the survival object and have an interrupt.
# Plotting the function
plot(Cox)
The Cox_mod output is similar to the regression model. There are some important features like age,
sex, ph.ecog and wt. loss. The plot gives the following output:
Here, the x-axis specifies “Number of days” and the y-axis specifies “probability of survival“. The
dashed lines are upper confidence interval and lower confidence interval. In comparison with the
Kaplan-Meier plot, the Cox plot is high for initial values and lower for higher values because of
more variables in the Cox plot.
We also have the confidence interval which shows the margin of error expected i.e In days of
surviving 200 days, the upper confidence interval reaches 0.82 or 82% and then goes down to 0.70
or 70%.
Note: Cox model serves better results than Kaplan-Meier as it is most volatile with data and
features. Cox model is also higher for lower values and vice-versa i.e drops down sharply when the
time increases.
Keywords
Response variable: The variable of interest that is being modeled and predicted by a GLM. It can
be continuous, binary, count, or ordinal.
Predictor variable: The variable(s) used to explain the variation in the response variable. They can
be continuous, binary, or categorical.
Link function: A function used to relate the mean of the response variable to the linear predictor. It
can be used to transform the response variable to a different scale or to model the relationship
between the predictor and response variables.
Business Analytics
Exponential family: A class of probability distributions that includes the Poisson, binomial,
gamma, and normal distributions, among others. GLMs are based on the exponential family of
distributions.
Maximum likelihood estimation: A method used to estimate the parameters of a GLM by finding
the values that maximize the likelihood of the observed data.
Goodness of fit: A measure of how well a GLM model fits the observed data. It can be evaluated
using deviance, residual analysis, and other methods.
Residual analysis: A method used to check the assumptions of a GLM model and identify potential
problems such as outliers and influential observations.
Model selection: A process of comparing different GLM models and selecting the best one based
on their fit and complexity, using AIC/BIC, likelihood ratio tests, and other methods.
SelfAssessment
1. What package is the core package for survival analysis in R?
A. ggplot2
B. dplyr
C. survival
D. tidyr
4. Which package provides additional visualization functions for interpreting survival analysis
results?
A. ggplot2
B. dplyr
C. survival
D. survminer
5. Which function is used to fit a Cox proportional hazards regression model to the data?
A. coxph()
B. phreg()
C. survfit()
D. flexsurv()
B. GLMs cannot be used for non-linear relationships between the predictor and response
variables
C. The variance of the response variable can be non-constant
D. GLMs assume a linear relationship between the predictor and response variables
8. Which of the following is a method for selecting the best predictors in a GLM?
A. Principle Component Analysis (PCA)
B. Multiple Linear Regression
C. Stepwise Regression
D. Analysis of Variance (ANOVA)
13. What does the family argument in the glm() function specify?
A. The type of distribution for the response variable
B. The type of distribution for the predictor variable
C. The type of link function to use
D. The type of loss function to use
Business Analytics
14. How can you check the goodness of fit of a GLM model in R?
A. Using the summary() function
B. Using the anova() function
C. Using the plot() function
D. Using the residuals() function
Answers forSelfAssessment
l. C 2. A 3. C 4. D 5. A
6. C 7. C 8. C 9. B 10. A
Review Questions
1. A hospital wants to determine the factors that affect the length of stay for patients. What
type of GLM would be appropriate for this analysis?
2. A manufacturing company is interested in modeling the number of defective items
produced per day. What type of GLM would be appropriate for this analysis?
3. A bank is interested in predicting the probability of default for a loan applicant. What type
of GLM would be appropriate for this analysis?
4. A marketing company wants to model the number of clicks on an online advertisement.
What type of GLM would be appropriate for this analysis?
5. A sports team is interested in predicting the probability of winning a game based on the
number of goals scored. What type of GLM would be appropriate for this analysis?
6. A social scientist wants to model the number of criminal incidents per month in a city.
What type of GLM would be appropriate for this analysis?
7. What is survival analysis and what types of data is it typically used for?
8. What is a Kaplan-Meier survival curve, and how can it be used to visualize survival data?
9. What is the Cox proportional hazards regression model, and what types of data is it
appropriate for analyzing?
10. What is a hazard ratio, and how is it calculated in the context of the Cox proportional
hazards model?
11. How can the results of a Cox proportional hazards regression model be interpreted, and
what types of conclusions can be drawn from the analysis?
12. How can competing risks analysis be conducted in R, and what types of outcomes is it
appropriate for analyzing?
13. What are some common visualizations used in survival analysis, and how can they be
created using R functions?
14. What are some potential sources of bias in survival analysis, and how can they be
addressed or minimized?
Further Reading
The GLM section of UCLA's Institute for Digital Research and Education website
provides a detailed introduction to GLMs, along with examples and tutorials using
different statistical software packages such as R and Stata:
https://stats.idre.ucla.edu/r/dae/generalized-linear-models/
The CRAN Task View on "Distributions and Their Inference" provides a
comprehensive list of R packages related to GLMs, along with their descriptions
and links to documentation: https://cran.r-
project.org/web/views/Distributions.html
"Generalized Linear Models" by P. McCullagh and J. A. Nelder: This is a classic
book on GLMs and provides a thorough treatment of the theory and applications of
GLMs.
"An Introduction to Generalized Linear Models" by A. Dobson: This book is a
concise introduction to GLMs and covers the key concepts and methods in a clear
and accessible way
1. .
Objective
After studying this student will be able to
Introduction
Machine learning is a field of artificial intelligence that has been rapidly growing in recent years,
and has already had a significant impact on many industries.At its core, machine learning involves
the development of algorithms and models that can learn patterns in data, and then use those
patterns to make predictions or decisions about new data. There are several different types of
machine learning, including supervised learning, unsupervised learning, and reinforcement
learning. Each of these types of machine learning has its own strengths and weaknesses, and is
suited to different types of problems.
Business Analytics
One of the most important applications of machine learning is in the field of natural language
processing (NLP). NLP involves using machine learning to analyze and understand human
language, and is used in applications such as chatbots, voice assistants, and sentiment analysis.
NLP is also important in fields such as healthcare, where it can be used to extract useful
information from patient records and other medical documents.
Another important application of machine learning is in computer vision. This involves using
machine learning to analyze and interpret visual data and is used in applications such as image and
video recognition, facial recognition, and object detection. Computer vision is important in fields
such as self-driving cars, where it is used to help vehicles navigate and avoid obstacles.
Predictive modeling is another important application of machine learning. This involves using
machine learning to make predictions based on data, and is used in applications such as fraud
detection, stock market prediction, and customer churn prediction. Predictive modeling is
important in fields such as marketing, where it can be used to identify customers who are likely to
leave a company and take steps to retain them.
The potential for machine learning is enormous, and its applications are likely to continue to
expand in the coming years. One area where machine learning is likely to have a significant impact
is in healthcare. Machine learning can be used to analyze patient data and identify patterns that
could be used to diagnose and treat a wide range of diseases. Machine learning can also be used to
identify patients who are at high risk of developing certain conditions and take steps to prevent
those conditions from occurring.
Another area where machine learning is likely to have a significant impact is in education. Machine
learning can be used to analyze student data and identify patterns that could be used to improve
learning outcomes. Machine learning can also be used to personalize learning for individual
students, by adapting the pace and style of instruction to their individual needs.
In conclusion, machine learning is a rapidly growing field with many exciting applications. Its
ability to learn from data and make predictions or decisions based on that data has already had a
significant impact on many industries, and its potential for the future is enormous. As more data
becomes available and more powerful computing resources become available, machine learning is
likely to continue to grow in importance and have a significant impact on many aspects of our lives.
Business Analytics
That is what AI and machine learning is doing for you. These technologies let advertisers modify
the way they market. AI is changing the e-commerce landscape in a significant way, giving
marketers the advantage of tailoring their marketing strategies while also saving businesses a lot of
money.
The retail industry has reduced overstock, improved shipping times, and cut returns by 3 million
times as a result of artificial intelligence. Current trends suggest that machines will be able to
supplement your staff’s weak spots in the future without having to resort to mass firings.
The increasing use of artificial intelligence will likely continue to affect the advertising industry.
With machine learning, marketers will get a deeper understanding of their customers’ minds and
hearts and will easily create communications layouts tailored to each customer.
Financial analysis
Using machine learning algorithms, financial analytics can accomplish simple tasks, like estimating
business spending and calculating costs. The jobs of algorithmic traders and fraud detectors are
both challenging. For each of these scenarios, historical data is examined to forecast future results
as accurately as possible.
In many cases, a small set of data and a simple machine learning algorithm can be sufficient for
simple tasks like estimating a business’s expenses. It’s worthwhile to note that stock traders and
dealers rely heavily on ML to accurately predict market conditions before entering the market.
Organizations can control their overall costs and maximize profits with accurate and timely
projections. When combined with automation, user analytics will result in significant cost savings.
Cognitive services
Another important use of machine learning in businesses is secure and intuitive authentication
procedures through computer vision, image recognition, and natural language processing. What is
more, businesses can reach a lot wider audiences, as NLP allows access to multiple geographic
locations, language holders, and ethnic groups.
Another example of cognitive services is automatic or self-checkouts. Because of machine learning,
we have upgraded retail experiences, with Amazon Go being the perfect example.
Business Analytics
Image Classification: Given a set of labeled images, the algorithm learns to classify new images
into the correct category. For example, a cat vs dog image classification task.
Sentiment Analysis: Given a set of labeled text data, the algorithm learns to classify new text
data as positive, negative, or neutral sentiment.
Spam Detection: Given a set of labeled emails, the algorithm learns to classify new emails as
spam or not spam.
Language Translation: Given a set of labeled pairs of sentences in two different languages, the
algorithm learns to translate new sentences from one language to the other.
Fraud Detection: Given a set of labeled transactions, the algorithm learns to classify new
transactions as fraudulent or legitimate.
Handwriting Recognition: Given a set of labeled handwritten letters and digits, the algorithm
learns to recognize new handwritten letters and digits.
Speech Recognition: Given a set of labeled audio samples of speech, the algorithm learns to
transcribe new speech into text.
Recommendation Systems: Given a set of labeled user-item interactions, the algorithm learns to
recommend new items to users based on their preferences.
Training: The model is trained on the labeled dataset by minimizing a loss function that measures
the difference between the predicted output and the true output. The model learns to adjust its
parameters to improve its predictions.
Evaluation: The trained model is evaluated on a separate test set to measure its performance. This
is important to ensure that the model can generalize to new, unseen data.
Deployment: Once the model has been trained and evaluated, it can be deployed in the real
world to make predictions on new, unseen data.
xgboost: A package for building gradient boosting models, which are a type of ensemble learning
method.
keras: A package for building deep learning models using the Keras API.
nnet: A package for building neural network models using the backpropagation algorithm.
ranger: A package for building random forest models that are optimized for speed and memory
efficiency.
gbm: A package for building gradient boosting models, which are a type of ensemble learning
method.
rpart: A package for building decision trees, which are a type of classification and regression tree
method.
These are just a few of the many packages available in R for supervised learning. The choice of
package will depend on the specific problem being solved and the characteristics of the data.
Example-1
In R, there are many built-in datasets that can be used to demonstrate the use of KNN for
supervised learning. One such dataset is the "iris" dataset, which contains measurements of the
sepal length, sepal width, petal length, and petal width of three different species of iris flowers.
Here's an example of how to use the "iris" dataset for KNN supervised learning in R:
# Load the "iris" dataset
data(iris)
# Split the data into training and testing sets
library(caret)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Preprocess the data by normalizing the features
preprocess <- preProcess(trainData[,1:4], method = c("center", "scale"))
trainData[,1:4] <- predict(preprocess, trainData[,1:4])
testData[,1:4] <- predict(preprocess, testData[,1:4])
# Train the KNN model with K=3 and Euclidean distance metric
library(class)
predicted <- knn(trainData[,1:4], testData[,1:4], trainData$Species, k = 3, metric = "Euclidean")
# Evaluate the model's accuracy
library(caret)
confusionMatrix(predicted, testData$Species)
In this example, we first load the "iris" dataset and split it into training and testing sets using the
createDataPartition() function from the caret package. We then preprocess the data by normalizing
the features using the preProcess() function from the same package.
Business Analytics
Next, we train the KNN model using the knn() function from the class package with K=3 and the
Euclidean distance metric. Finally, we evaluate the performance of the model using the
confusionMatrix() function from the caret package, which calculates the accuracy, precision, recall,
and F1 score of the predictions.
Output
The output of the above example is the confusion matrix, which shows the performance of the
KNN model on the testing set. The confusion matrix contains four values:
True Positive (TP): The number of samples that are correctly classified as positive (i.e., the species
of iris is correctly predicted).
False Positive (FP): The number of samples that are incorrectly classified as positive (i.e., the species
of iris is wrongly predicted).
True Negative (TN): The number of samples that are correctly classified as negative (i.e., the species
of iris is not predicted).
False Negative (FN): The number of samples that are incorrectly classified as negative (i.e., the
species of iris is wrongly predicted as not being present).
The confusion matrix shows that the KNN model achieved an accuracy of 0.9333 on the testing set,
which means that 93.33% of the test samples were correctly classified by the model.
The matrix also shows that the model made 1 false prediction for the "versicolor" class and 1 false
prediction for the "virginica" class. In other words, the model correctly classified all 10 "setosa"
species but made 2 incorrect predictions for the other two species.
Additionally, the matrix provides other metrics such as sensitivity, specificity, positive and
negative predictive values, and prevalence for each class. These metrics provide a more detailed
evaluation of the performance of the model on the different classes.
In Decision Tree, the tree structure is built based on information gain or entropy reduction, which
measures the reduction in uncertainty about the target variable that results from splitting the data
using a particular attribute. The attribute with the highest information gain is chosen as the
splitting criterion at each node.
The algorithm continues to split the data into subsets until a stopping criterion is met, such as
reaching a maximum tree depth, a minimum number of samples in a leaf node, or when all the
samples in a node belong to the same class.
Once the tree is built, it can be used to make predictions by traversing the tree from the root node
down to a leaf node that corresponds to the predicted class. The class of a new sample is
determined by following the path from the root node to the leaf node that the sample falls into.
To implement Decision Tree in R, we can use the "rpart" package, which provides the "rpart()"
function to build a Decision Tree model. The package also provides functions for visualizing the
tree structure and making predictions on new data.
Example-1
Let's demonstrate how to use the "rpart" package to build a Decision Tree model using an inbuilt
dataset in R. We will use the "iris" dataset, which contains measurements of the sepal length, sepal
width, petal length, and petal width of three different species of iris flowers.
Here is an example code that builds a Decision Tree model on the "iris" dataset:
library(rpart)
data(iris)
# Split the dataset into training and testing sets
set.seed(123)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
trainData <- iris[ind == 1, ]
testData <- iris[ind == 2, ]
# Build the Decision Tree model
model <- rpart(Species ~ ., data = trainData, method = "class")
# Visualize the Decision Tree
plot(model)
text(model)
# Make predictions on the testing set
predicted <- predict(model, testData, type = "class")
# Evaluate the model performance
confusionMatrix(predicted, testData$Species)
In the code above, we first load the "rpart" package and the "iris" dataset. We then split the dataset
into a training set and a testing set, with a 70-30 split.
Next, we build the Decision Tree model using the "rpart()" function, where we specify the target
variable "Species" and the other variables as the predictors using the formula notation "Species ~ .".
We also specify the method as "class" to indicate that this is a classification problem.
After building the model, we can visualize the Decision Tree using the "plot()" and "text()"
functions.
We then use the "predict()" function to make predictions on the testing set and specify the type of
prediction as "class" to obtain the predicted class labels.
Finally, we evaluate the performance of the model using the "confusionMatrix()" function from the
"caret" package, which computes the confusion matrix and other metrics such as accuracy,
sensitivity, and specificity.
Business Analytics
The output of the "confusionMatrix()" function provides a detailed evaluation of the performance of
the Decision Tree model on the testing set. For example, it shows the accuracy of the model, the
number of true positives, true negatives, false positives, and false negatives for each class, as well as
other performance metrics such as sensitivity, specificity, positive predictive value, and negative
predictive value.
Output
The output of the "confusionMatrix()" function shows that the model has an accuracy of 95.56%,
which means that it correctly predicted the species of 43 out of 45 instances in the testing set.
The confusion matrix shows that there is only one misclassification, where one instance of the
"setosa" species was misclassified as "versicolor". The model correctly predicted all instances of the
"virginica" species.
Overall, the result shows that the Decision Tree model performed well on the "iris" dataset,
achieving high accuracy and making only one misclassification. However, it is important to note
that the dataset is relatively small and simple, so the model's performance may not generalize well
to other datasets.
Business Analytics
Overall, unsupervised learning requires a careful analysis of the data, selection of appropriate
features and models, and evaluation of the results. It is an iterative process where the model can be
refined based on the results and feedback.
Overall, K-means is a popular clustering algorithm used in unsupervised learning. The process
involves choosing the number of clusters, initializing centroids, assigning data points to clusters,
recalculating centroids, and repeating until convergence. The results can be evaluated by
calculating the WCSS or by visualizing the clusters.
Example-1
# Load the iris dataset
data(iris)
# Select the relevant features
iris_data <- iris[,1:4]
# Scale the data
scaled_data <- scale(iris_data)
# Perform K-means clustering with 3 clusters
kmeans_result <- kmeans(scaled_data, centers = 3)
# Plot the clusters
library(cluster)
clusplot(scaled_data, kmeans_result$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)
# Calculate the within-cluster sum of squares
wss <- sum(kmeans_result$withinss)
# Print the WCSS
cat("WCSS:", wss)
In this example, the iris dataset is loaded and the relevant features are selected. The data is then
scaled using the scale function. K-means clustering is performed with 3 clusters using the kmeans
function. The clusters are then visualized using the clusplot function from the cluster package. The
within-cluster sum of squares (WCSS) is calculated using the kmeans_result$withinss variable and
printed using the cat function.
The output of this code will be a plot of the clusters and the WCSS value. The plot will show the
data points colored by their assigned cluster and the WCSS value will indicate the sum of the
squared distances between each data point and its assigned cluster center.
Output
The output of the example code using K-means clustering on the iris dataset in R includes a plot of
the clusters and the within-cluster sum of squares (WCSS) value.
The plot shows the data points colored by their assigned cluster and separated into three clusters.
The clusplot function from the cluster package is used to create the plot. The horizontal axis shows
the first principal component of the data, while the vertical axis shows the second principal
component. The data points are shaded to show the density of the points in each region. The plot
shows that the data points are well-separated into three distinct clusters, each containing data
points that are similar to each other.
The WCSS value is a measure of the goodness of fit of the K-means clustering model. It measures
the sum of the squared distances between each data point and its assigned cluster center. The lower
the WCSS value, the better the model fits the data. In this example, the WCSS value is printed to the
console using the cat function. The WCSS value for this model is 139.82.
Overall, the output of this example demonstrates how to use K-means clustering on the iris dataset
in R. The plot shows the clusters in the data, while the WCSS value indicates the goodness of fit of
the model.
Business Analytics
clustering is to build a hierarchy of clusters, starting with each data point in its own cluster and
then merging the most similar clusters together until all of the data points are in a single cluster.
Example-1
In R, we can use the hclust function to perform hierarchical clustering on a dataset. Here is an
example of using hierarchical clustering on the iris dataset in R:
# Load the iris dataset
data(iris)
# Select the relevant features
iris_data <- iris[,1:4]
# Calculate the distance matrix
dist_matrix <- dist(iris_data)
# Perform hierarchical clustering with complete linkage
hc_result <- hclust(dist_matrix, method = "complete")
# Plot the dendrogram
plot(hc_result)
In this example, we first load the iris dataset and select the relevant features. We then calculate the
distance matrix using the dist function. The hclust function is used to perform hierarchical
clustering with complete linkage. The resulting dendrogram is plotted using the plot function.
The output of this code will be a dendrogram that shows the hierarchy of clusters. The dendrogram
shows the data points at the bottom, with lines connecting the clusters that are merged together.
The height of each line indicates the distance between the merged clusters. The dendrogram can be
used to identify the number of clusters in the data based on the distance between the clusters.
Output
The output of the example code using hierarchical clustering on the iris dataset in R is a
dendrogram that shows the hierarchy of clusters.
The dendrogram is a plot that displays the hierarchy of clusters, with the data points at the bottom
and the merged clusters shown as lines that connect the points. The height of each line indicates the
distance between the merged clusters. The dendrogram can be used to determine the optimal
number of clusters in the data based on the distance between the clusters.
In this example, we used the hclust function to perform hierarchical clustering with complete
linkage on the iris dataset. The resulting dendrogram shows that there are three main clusters of
data points in the dataset, which is consistent with the known number of classes in the iris dataset.
By analyzing the dendrogram, we can see that the first split in the data occurs between the setosa
species and the other two species. The second split separates the remaining two species, versicolor
and virginica. The dendrogram can also be used to identify the distance between clusters, which
can be useful for determining the optimal number of clusters to use for further analysis.
Overall, the output of this example demonstrates how to use hierarchical clustering to analyze a
dataset in R and visualize the results using a dendrogram.
WCSS is a measure of the sum of the squared distances between each data point and its assigned
cluster center. A lower WCSS value indicates a better clustering performance. Silhouette score is a
measure of how well each data point fits into its assigned cluster compared to other clusters. A
higher silhouette score indicates a better clustering performance.
While classification and prediction accuracy are not used to evaluate unsupervised learning
algorithms directly, they can be used in certain scenarios to evaluate the performance of an
unsupervised learning model indirectly. For example, if we use the clusters formed by an
unsupervised learning algorithm as input features for a subsequent classification or prediction task,
we can use classification and prediction accuracy as metrics to evaluate the performance of the
overall model.
In summary, while classification and prediction accuracy are not typically used to evaluate the
performance of unsupervised learning algorithms, they can be used in certain scenarios to evaluate
the performance of the overall model that uses the clusters formed by the unsupervised learning
algorithm as input features.
Summary
Machine learning is a field of artificial intelligence that involves developing algorithms and models
that enable computers to learn from data without being explicitly programmed. Machine learning is
used in a wide range of applications, from image and speech recognition to fraud detection and
recommendation systems. There are three main types of machine learning: supervised learning,
unsupervised learning, and reinforcement learning. In supervised learning, the machine is trained
using labeled data, while in unsupervised learning, the machine is trained using unlabeled data.
Reinforcement learning involves training a machine to learn through trial and error. Machine
learning algorithms are typically designed to improve over time as they are exposed to more data,
and they are used in a variety of industries and fields to automate decision-making and solve
complex problems. Studying machine learning provides students with a diverse set of skills and
knowledge, including programming, data handling, analytical and problem-solving skills,
collaboration, and communication skills.
Supervised learning algorithms can be further categorized as either classification or regression,
depending on the nature of the target variable. In classification problems, the target variable is
categorical, and the goal is to predict the class label or category of a new instance. In regression
problems, the target variable is continuous, and the goal is to predict a numerical value or a range
of values for a new instance.
Some common examples of supervised learning algorithms include linear regression, logistic
regression, decision trees, random forests, support vector machines (SVMs), k-nearest neighbors
(KNN), and neural networks. Each algorithm has its own strengths and weaknesses, and the choice
of algorithm depends on the nature of the problem and the characteristics of the data.
Supervised learning has many practical applications in fields such as healthcare, finance,
marketing, and engineering, among others. For example, supervised learning can be used to predict
which patients are at risk of developing a certain disease, to identify potential fraudulent
transactions in financial transactions, or to recommend products to customers based on their
browsing history.
Unsupervised learning algorithms can be used for a variety of tasks, including clustering similar
data points, reducing the dimensionality of data, and discovering hidden structures in the data.
Some common techniques used in unsupervised learning include k-means clustering, hierarchical
clustering, and principal component analysis (PCA).
The performance of unsupervised learning algorithms is typically evaluated using metrics such as
within-cluster sum of squares (WCSS) and silhouette score, which are used to evaluate the quality
of the clusters formed by the algorithm.
While unsupervised learning algorithms are not typically used for making predictions or
classifying new data points, the insights gained from analyzing the data can be used to inform
subsequent supervised learning models or other data analysis tasks. Overall, unsupervised learning
is a valuable tool for exploring and understanding complex data without prior knowledge or
guidance.
Business Analytics
Keywords
Artificial Intelligence (AI): A field of computer science that focuses on creating intelligent
machines that can perform tasks that typically require human-like intelligence.
Big data: A large and complex data set that requires advanced tools and techniques to process and
analyze.
Data mining: The process of discovering patterns, trends, and insights in large data sets using
machine learning algorithms.
Deep learning: A subset of machine learning that uses artificial neural networks to model and
solve complex problems.
Neural network: A machine learning algorithm that is inspired by the structure and function of the
human brain.
Supervised learning: A type of machine learning where the machine is trained using labeled data,
with a clear input and output relationship.
Unsupervised learning: A type of machine learning where the machine is trained using unlabeled
data, with no clear input and output relationship.
Reinforcement learning: A type of machine learning where the machine learns by trial and error,
receiving feedback on its actions and adjusting its behavior accordingly.
Model: A mathematical representation of a real-world system or process, which is used to make
predictions or decisions based on data. In machine learning, models are typically trained on data to
improve their accuracy and performance.
Dimensionality reduction: The process of reducing the number of features used in a machine
learning model while still retaining important information. This is often done to improve
performance and reduce overfitting.
Overfitting: A problem that occurs when a machine learning model is too complex and learns to fit
the training data too closely. This can lead to poor generalization to new data.
Underfitting: A problem that occurs when a machine learning model is too simple and fails to
capture important patterns in the data. This can lead to poor performance on the training data and
new data.
Bias: A systematic error that occurs when a machine learning model consistently makes predictions
that are too high or too low.
Variance: The amount by which a machine learning model's output varies with different training
data sets. High variance can lead to overfitting.
Regularization: Techniques used to prevent overfitting in machine learning models, such as adding
a penalty term to the cost function.
SelfAssessment
1. Which of the following is a type of machine learning where the machine is trained using
labeled data?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above
2. What is the process of reducing the number of features used in a machine learning model
while still retaining important information?
A. Overfitting
B. Underfitting
C. Dimensionality reduction
D. Bias
4. What is the name of a machine learning algorithm that is inspired by the structure and
function of the human brain?
A. Neural network
B. Gradient descent
C. Decision tree
D. Support vector machine
5. Which type of machine learning involves training a machine to learn through trial and
error?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above
6. Which of the following is a type of machine learning where the machine is trained using
unlabeled data?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above
7. What is the process of discovering patterns, trends, and insights in large data sets using
machine learning algorithms called?
A. Feature engineering
B. Deep learning
C. Data mining
D. Supervised learning
8. Which of the following techniques is used to combine multiple machine learning models to
improve performance and reduce overfitting?
A. Gradient descent
B. Hyperparameter tuning
C. Ensemble learning
D. Regularization
Business Analytics
9. What is the name of an optimization algorithm used to find the optimal parameters for a
machine learning model by iteratively adjusting them in the direction of steepest descent of
the cost function?
A. Gradient descent
B. Hyperparameter tuning
C. Regularization
D. Ensemble learning
10. Which of the following is a technique used to prevent underfitting in machine learning
models?
A. Ensemble learning
B. Gradient descent
C. Regularization
D. Hyperparameter tuning
11. Which of the following is not a supervised learning problem?
A. Image classification
B. Text clustering
C. Stock price prediction
D. Sentiment analysis
13. Which package in R provides functions for classification and regression trees?
A. caret
B. e1071
C. randomForest
D. rpart
B. knn
C. svm
D. lda
18. Which of the following evaluation metrics is not used for classification problems?
A. Mean squared error (MSE)
B. Accuracy
C. Precision
D. Recall
19. In which type of supervised learning problem is the target variable categorical?
A. Regression
B. Clustering
C. Classification
D. Dimensionality reduction
Business Analytics
6. B 7. C 8. C 9. A 10. D
Review Question
1) What is machine learning, and how is it different from traditional programming?
2) What are the three main types of machine learning, and what are some examples of
problems each type can solve?
3) What is the process of preparing data for use in a machine learning model, and why is it
important?
4) What are some real-world applications of supervised learning, and how are they
implemented?
5) How can machine learning be used to improve healthcare outcomes, and what are some
potential benefits and risks of using machine learning in this context?
6) How can machine learning be used to improve financial decision-making, and what are
some potential benefits and risks of using machine learning in this context?
7) How can machine learning be used to detect and prevent fraud, and what are some potential
benefits and risks of using machine learning in this context?
8) How can machine learning be used to optimize supply chain management, and what are
some potential benefits and risks of using machine learning in this context?
9) How can machine learning be used to improve customer service and customer experience,
and what are some potential benefits and risks of using machine learning in this context?
10) How can machine learning be used to enhance security and privacy, and what are some
potential benefits and risks of using machine learning in this context?
11) How can machine learning be used to advance scientific research, and what are some
potential benefits and risks of using machine learning in this context?
Further readings
learning, including tutorials, code examples, and best practices. It also includes a section
on deep learning, which is a type of machine learning that is particularly well-suited for
tasks like image recognition and natural language processing.
The Stanford Machine Learning Group: This research group at Stanford University is at
the forefront of developing new machine learning techniques and applications. Their
website includes a wide range of research papers, code libraries, and other resources for
exploring the latest developments in the field.
The Google AI Blog: Google is one of the leading companies in the field of machine
learning, and their AI blog offers insights into the latest research, tools, and applications.
They cover a wide range of topics, from natural language processing and computer vision
to ethics and fairness in machine learning.
The Microsoft Research Blog: Microsoft is another major player in the field of machine
learning, and their research blog covers a wide range of topics related to AI, including
machine learning, deep learning, and natural language processing. They also offer a
variety of tools and resources for developers who want to build machine learning
applications.
The MIT Technology Review: This publication covers a wide range of topics related to
technology and its impact on society, including machine learning. Their articles are often
well-researched and thought-provoking and can provide insights into the broader
implications of machine learning for society and the economy.
Dr. Mohd Imran Khan, Lovely Professional University Unit 07: Text Analytics for Business
Introduction
Text analytics for business involves using advanced computational techniques to analyze and
extract insights from large volumes of text data. This data can come from a wide range of sources,
including customer feedback, social media posts, product reviews, news articles, and more.
The goal of text analytics for business is to provide organizations with valuable insights that can be
used to make data-driven decisions and improve business performance. This includes identifying
patterns and trends in customer behavior, predicting future trends, monitoring brand reputation,
detecting fraud, and more.
Some of the key techniques used in text analytics for business include natural language processing
(NLP), which involves using computational methods to analyze and understand human language,
and machine learning algorithms, which can be trained to automatically identify patterns and
relationships in text data.
There are many different tools and platforms available for text analytics, ranging from open-source
software to commercial solutions. These tools typically include features for data cleaning and
preprocessing, feature extraction, data visualization, and more.
Overall, text analytics for business can provide organizations with a powerful tool for
understanding and leveraging the vast amounts of text data available to them. By using these
techniques to extract insights and make data-driven decisions, businesses can gain a competitive
advantage and improve their overall performance.
Text analytics for business is a powerful tool for analyzing large volumes of text data and extracting
valuable insights that can be used to make data-driven decisions. However, it is important to keep
in mind several key considerations when working with text data.
Firstly, domain expertise is crucial when analyzing text data. This means having a deep
understanding of the specific industry or context in which the text data is being analyzed. This is
especially important for industries such as healthcare or finance, where specialized knowledge is
required to properly interpret the data.
Secondly, it is important to consider the ethical implications of text analytics. This includes
ensuring that data privacy regulations are followed, and that the data is used ethically and
responsibly. It is also important to be transparent about the use of text analytics and to obtain
consent from those whose data is being analyzed.
Thirdly, integrating text data with other data sources can provide a more comprehensive
understanding of business operations and customer behavior. This can include structured data
from databases or IoT devices, as well as other sources of unstructured data such as images or
audio.
Fourthly, it is important to be aware of the limitations of text analytics. While text analytics is a
powerful tool, automated methods may struggle with complex or nuanced language, or with
accurately interpreting sarcasm or irony.
Business Analytics
There are a range of ways that text analytics can help businesses, organizations, and event social
movements:
Help businesses to understand customer trends, product performance, and service quality. This
results in quick decision making, enhancing business intelligence, increased productivity, and cost
savings.
Helps researchers to explore a great deal of pre-existing literature in a short time, extracting what is
relevant to their study. This helps in quicker scientific breakthroughs.
Assists in understanding general trends and opinions in the society, that enable governments and
political bodies in decision making.
Text analytic techniques help search engines and information retrieval systems to improve their
performance, thereby providing fast user experiences.
Refine user content recommendation systems by categorizing related content.
Sentiment analysis
Sentiment analysis is used to identify the emotions conveyed by the unstructured text. The input
text includes product reviews, customer interactions, social media posts, forum discussions, or
blogs. There are different types of sentiment analysis. Polarity analysis is used to identify if the text
expresses positive or negative sentiment. The categorization technique is used for a more fine-
grained analysis of emotions - confused, disappointed, or angry.
Brand reputation monitoring: Sentiment analysis can be used to monitor brand reputation by
analyzing mentions of a company or brand on social media, news articles, or other online sources.
This can help companies to identify negative sentiment or potential issues and respond quickly to
protect their brand reputation.
Market research: Sentiment analysis can be used in market research to understand consumer
sentiment towards a particular product, service, or brand. This can help companies to identify
opportunities for innovation or new product development.
Financial analysis: Sentiment analysis can be used in financial analysis to analyze the sentiment
expressed in news articles, social media posts, and other sources of financial news. This can help
investors to make more informed decisions by identifying potential risks and opportunities.
Business Analytics
Political analysis: Sentiment analysis can be used in political analysis to analyze public opinion
and sentiment towards political candidates or issues. This can help political campaigns to identify
key issues and target their messaging more effectively.
Topic modelling
This technique is used to find the major themes or topics in a massive volume of text or a set of
documents. Topic modeling identifies the keywords used in text to identify the subject of the
article.
Customer feedback analysis: Topic modeling can be used to analyze customer feedback, such as
product reviews or survey responses, to identify common themes or topics. This can help
companies to identify areas for improvement and prioritize customer needs.
Trend analysis: Topic modeling can be used to identify trends and patterns in large volumes of text
data, such as social media posts or news articles. This can help companies to stay up-to-date on the
latest trends and identify emerging topics or issues.
Competitive analysis: Topic modeling can be used to analyze competitor websites, social media
pages, and other online sources to identify key topics or themes. This can help companies to stay
competitive by understanding the strengths and weaknesses of their competitors.
Fraud detection: NER can be used to identify names, addresses, and other personal information
associated with fraudulent activities, such as credit card fraud or identity theft. This can help
financial institutions and law enforcement agencies to prevent fraud and protect their customers.
Media monitoring: NER can be used to monitor mentions of specific companies, individuals, or
topics in news articles or social media posts. This can help companies to stay up-to-date on the
latest trends and monitor their brand reputation.
Market research: NER can be used to identify the names and affiliations of experts or key
influencers in a particular industry or field. This can help companies to conduct more targeted
research and identify potential collaborators or partners.
Document categorization: NER can be used to automatically categorize documents based on the
named entities mentioned in the text. This can help companies to quickly identify relevant
documents and extract useful information.
Event extraction
This is a text analytics technique that is an advancement over the named entity extraction. Event
extraction recognizes events mentioned in text content, for example, mergers, acquisitions, political
moves, or important meetings. Event extraction requires an advanced understanding of the
semantics of text content. Advanced algorithms strive to recognize not only events but the venue,
participants, date, and time wherever applicable. Event extraction is a beneficial technique that has
multiple uses across fields.
Business Analytics
Stemming and lemmatization: Stemming and lemmatization are techniques used to reduce words
to their base form or root form. This can help to reduce the dimensionality of the data and improve
the accuracy of text analytics models.
Sentiment analysis: Sentiment analysis is a common text analytics technique used to identify the
sentiment or emotion expressed in the text data. R programming offers several packages and
functions for sentiment analysis, including the popular "tidytext" and "sentimentr" packages.
Topic modeling: Topic modeling is another common text analytics technique used to identify the
underlying topics or themes in the text data. R programming offers several packages and functions
for topic modeling, including the "tm" and "topicmodels" packages.
Named entity recognition: Named entity recognition is a technique used to identify and classify
named entities, such as people, organizations, and locations, within the text data. R programming
offers several packages and functions for named entity recognition, including the "openNLP" and
"NLP" packages.
Overall, R programming provides a wide range of tools and techniques for creating and refining
text data for text analytics. By using R programming to preprocess, analyze, and visualize text data,
businesses can gain valuable insights into customer behavior, market trends, and potential risks or
opportunities.
Business Analytics
Business Analytics
head(20)
negative_words<- sentiment_data %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
head(20)
wordcloud(positive_words$word, positive_words$n, scale=c(4,0.5), min.freq = 1, colors =
brewer.pal(8, "Dark2"))
wordcloud(negative_words$word, negative_words$n, scale=c(4,0.5), min.freq = 1, colors =
brewer.pal(8, "Dark2"))
By using R programming for sentiment analysis on customer reviews, you can gain insights into the
overall sentiment of customers towards your hotel chain, identify common
OUTPUT
World Cloud
Most of the words are indeed related to the hotels: room, staff, breakfast, etc. Some words are more
related to the customer experience with the hotel stay: perfect, loved, expensive, dislike, etc.
Sentiment Score
The above graph shows the distribution of the reviews sentiments among good reviews and bad
ones. We can see that good reviews are for most of them considered as very positive by Vader. On
the contrary, bad reviews tend to have lower compound sentiment scores.
Example-2
Here's an example of sentiment analysis on a real-world dataset of tweets related to the COVID-19
pandemic using R programming:
Data collection: First, we need to collect a dataset of tweets related to COVID-19. We can use the
Twitter API to collect the data, or we can use a pre-existing dataset such as the one available on
Kaggle.
# Load the dataset
tweets <- read.csv("covid19_tweets.csv")
# Filter for English-language tweets
tweets <- tweets %>% filter(lang == "en")
Data cleaning: Next, we need to clean and preprocess the text data to remove any unwanted
characters, punctuation, and stop words. We can use the "tm" package in R to perform this step:
# Create a corpus object
corpus <- Corpus(VectorSource(tweets$text))
# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
corpus <- tm_map(corpus, removeURL)
# Remove usernames
corpus <- tm_map(corpus, removeTwitterUser)
# Remove hashtags
corpus <- tm_map(corpus, removeHashTags)
# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Convert back to plain text
clean_data<- as.character(corpus)
Sentiment analysis: We can use the "tidytext" package in R to perform sentiment analysis on the
cleaned data. This package provides a pre-trained sentiment lexicon, which we can use to assign a
positive or negative sentiment score to each word in the text data:
# Load the sentiment lexicon
sentiments <- get_sentiments("afinn")
# Convert the cleaned data to a tidy format
tidy_data<- tibble(text = clean_data) %>%
unnest_tokens(word, text)
Business Analytics
The first visualization in the example is a histogram of sentiment scores, which shows the
distribution of sentiment in the dataset. The x-axis represents the sentiment score, and the y-axis
represents the number of tweets with that score. The histogram is colored in light blue and has
black borders.
The histogram shows that the sentiment scores in the dataset are mostly centered around zero,
indicating a neutral sentiment. However, there are some tweets with a positive sentiment score and
some tweets with a negative sentiment score, suggesting that there is some variation in the
sentiment of the tweets related to COVID-19.
The second visualization in the example is a time series plot of sentiment scores over time. The x-
axis represents the date of the tweet, and the y-axis represents the average sentiment score for
tweets posted on that day. The plot is colored in light blue and has a solid line connecting the
points.
The time series plot shows that the sentiment of the tweets related to COVID-19 has fluctuated over
time. There are some periods where the sentiment is more positive, such as in March 2020 when the
pandemic was first declared, and other periods where the sentiment is more negative, such as in
January 2021 when new variants of the virus were identified. The plot can help identify trends in
Business Analytics
the sentiment of the tweets related to COVID-19 over time, which can be useful for understanding
public opinion and sentiment around the pandemic.
Business Analytics
Summary
Text analytics, also known as text mining, is the process of analyzing unstructured text data to
extract meaningful insights and patterns. It involves applying statistical and computational
techniques to text data to identify patterns and relationships between words and phrases, and to
uncover insights that can help organizations make data-driven decisions.
Text analytics can be used for a wide range of applications, such as sentiment analysis, topic
modeling, named entity recognition, and event extraction. Sentiment analysis involves identifying
the sentiment of text data, whether it is positive, negative, or neutral. Topic modeling involves
identifying topics or themes within a text dataset, while named entity recognition involves
identifying and classifying named entities, such as people, organizations, and locations. Event
extraction involves identifying and extracting events and their related attributes from text data.
Text analytics can provide valuable insights for businesses, such as identifying customer
preferences and opinions, understanding market trends, and detecting emerging issues and
concerns. It can also help organizations monitor their brand reputation, improve customer service,
and optimize their marketing strategies.
Text analytics can be performed using various programming languages and tools, such as R,
Python, and machine learning libraries. It requires a combination of domain knowledge, statistical
and computational expertise, and creativity in identifying relevant patterns and relationships
within text data.
In summary, text analytics is a powerful tool for analyzing and extracting insights from
unstructured text data. It has a wide range of applications in business and can help organizations
make data-driven decisions, improve customer service, and optimize their marketing strategies.
Keywords
Text Analytics: The process of analyzing unstructured text data to extract meaningful insights and
patterns.
Sentiment Analysis: The process of identifying and extracting the sentiment of text data, whether it
is positive, negative, or neutral.
Topic Modeling: The process of identifying topics or themes within a text dataset.
Named Entity Recognition: The process of identifying and classifying named entities, such as
people, organizations, and locations, in a text dataset.
Event Extraction: The process of identifying and extracting events and their related attributes from
text data.
Natural Language Processing (NLP): The use of computational techniques to analyze and
understand natural language data.
Machine Learning: The use of algorithms and statistical models to learn patterns and insights from
data.
Corpus: A collection of text documents used for analysis.
Term Document Matrix: A matrix representation of the frequency of terms in a corpus.
Word Cloud: A visual representation of the most frequently occurring words in a corpus, with
larger font sizes indicating higher frequency.
SelfAssessment
1. What is text analytics?
A. The process of analyzing structured data
B. The process of analyzing unstructured text data
C. The process of analyzing both structured and unstructured data
D. The process of creating structured data from unstructured text data
Business Analytics
11. Which function in R is used to preprocess text data by removing stop words and
stemming?
A. tm_map()
B. corpus()
C. termFreq()
D. wordcloud()
C. tidyr
D. SentimentAnalysis
22. What is the purpose of the control argument in the LDA function in R?
Business Analytics
23. Which function in R can be used to print the top words in each topic after performing
topic modeling?
A. topics()
B. top.words()
C. lda_terms()
D. terms()
6. A 7. B 8. B 9. D 10. C
Review Questions
1) What are the common steps involved in topic modeling using R?
2) How can you preprocess text data for topic modeling in R?
3) What is a document-term matrix, and how is it used in topic modeling?
4) What is LDA, and how is it used for topic modeling in R?
5) How do you interpret the output of topic modeling in R, including the document-topic
matrix and top words in each topic?
6) What are some common techniques for evaluating the quality of topic modeling results in
R?
7) Can you describe some potential applications of topic modeling in various fields, such as
marketing, social sciences, or healthcare?
8) How can you visualize the results of topic modeling in R?
9) What are some best practices to follow when performing topic modeling in R, such as
choosing the optimal number of topics and tuning model parameters?
10) What are some common challenges in text analytics and how can they be addressed?
Further Readings
"Text Mining with R: A Tidy Approach" by Julia Silge and David Robinson - This book
provides a comprehensive introduction to text mining with R programming, covering
topics such as sentiment analysis, topic modeling, and natural language processing.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media, Inc.
Sarkar, D. (2019). Text analytics with Python: A practical real-world approach to gaining
actionable insights from your data. Apress.
Manning, C. D., Raghavan, P., &Schütze, H. (2008). Introduction to information retrieval.
Cambridge University Press.
Berry, M. W., & Castellanos, M. (2008). Survey of text mining: Clustering, classification,
and retrieval. Springer.
Introduction
Decisions drive organizations. Making a good decision at a critical moment may lead to a more
efficient operation, a more profitable enterprise, or perhaps a more satisfied customer. So, it only
makes sense that the companies that make better decisions are more successful in the long run.
That’s where business intelligence comes in. Business intelligence is defined in various ways (our
chosen definition is in the next section). For the moment, though, think of BI as using data about
yesterday and today to make better decisions about tomorrow. Whether it’s selecting the right
criteria to judge success, locating and transforming the appropriate data to draw conclusions, or
arranging information in a manner that best shines a light on the way forward, business
intelligence makes companies smarter. It allows managers to see things more clearly, and permits
them a glimpse of how things will likely be in the future.
Business intelligence is a flexible resource that can work at various organizational levels and
various times — these, for example:
A sales manager is deliberating over which prospects the account executives should focus on
in the final-quarter profitability push
An automotive firm’s research-and-development team is deciding which features to include in
next year’s sedan
The fraud department is deciding on changes to customer loyalty programs that will root out
fraud without sacrificing customer satisfaction
BI (Business Intelligence) is a set of processes, architectures, and technologies that convert raw data
into meaningful information that drives profitable business actions. It is a suite of software and
services to transform data into actionable intelligence and knowledge.
Business Intelligence evolved over period as:
BI has a direct impact on organization’s strategic, tactical and operational business decisions. BI
supports fact-based decision making using historical data rather than assumptions and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts
to provide users with detailed intelligence about the nature of the business.
8.2 BI – Advantages
Business Intelligence has following advantages:
1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click thus saves lots of
time and resources. It also allows employees to be more productive on their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to identify any areas
which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who should own
accountability and ownership for the organization’s performance against its set goals.
BI takes out all complexity associated with business processes. It also automates analytics by
offering predictive analysis, computer modeling, benchmarking, and other methodologies.
1. Cost:
Business intelligence can prove costly for small as well as medium-sized enterprises. The use of a
such type of system may be expensive for routine business transactions.
2. Complexity:
Another drawback of BI is its complexity in the implementation of the data warehouse. It can be so
complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the buying
competence of rich firms. Therefore, BI system is yet not affordable for many small and medium
size companies.
4. Time-Consuming Implementation
It takes almost one and a half year for the data warehousing system to be completely implemented.
Therefore, it is a time-consuming process.
Data:
This is the most important factor in business intelligence, as without data there is nothing to
analyze or report on. Data can come from a variety of sources, both internal and external to the
organization.
Data
Characteristics
Business intelligence thrives on data. Without data, there is nothing to analyze or report on. Data
can come from a variety of sources, both internal and external to the organization. Internal data
sources can include things like transaction data, customer data, financial data, and operational data.
External data sources can include public records, social media data, market research data, and
competitor data.
The data gathering process must be designed to collect the right data from the right sources. Once
the data is gathered, it must then be cleaned and standardized so that it can be properly analyzed.
In the BI environment, data is king. All other factors must be aligned in order to support the data
and help it reach its full potential.
People:
The people involved in business intelligence play a critical role in its success. From the data
analysts who gather and clean the data, to the business users who interpret and use the data to
make decisions, each person involved must have a clear understanding of their role in the process.
The people involved in business intelligence play a critical role in its success. From the data
analysts who gather and clean the data, to the business users who interpret and use the data to
make decisions, each person involved must have a clear understanding of their role in the process.
The data analysts need to be able to collect data from all relevant sources, clean and standardize the
data, and then load it into the BI system. The business users need to be able to access the data,
understand what it means, and use it to make decisions.
This can take many forms, but a key component is data literacy: the ability to read, work with,
analyze, and argue with data. Data literacy is essential for business users to be able to make
decisions based on data.
In a successful business intelligence environment, people are trained and empowered to use data to
make decisions.
Processes:
Database Types
Relational Database
Object-Oriented Database
Distributed Database
NoSQL Database
Graph Database
Cloud Database
Centralization Database
Operational Database
The processes used to gather, clean, and analyze data must be well-designed and efficient in order
to produce accurate and timely results.
The processes used to gather, clean, and analyze data must be well-designed and efficient in order
to produce accurate and timely results. The data gathering process must be designed to collect the
right data from the right sources. Once the data is gathered, it must then be cleaned and
standardized so that it can be properly analyzed.
The data analysis process must be designed to answer the right questions. The results of the data
analysis must be presented in a way that is easy to understand and use
Technology:
The technology used to support business intelligence must be up to date and able to handle the
volume and complexity of data.
The technology used to support business intelligence must be up to date and able to handle the
volume and complexity of data. The BI system must be able to collect data from all relevant
sources, clean and standardize the data, and then load it into the system. The system must be able
to support the data analysis process and provide easy-to-use tools for business users to access and
analyze the data.
You want your BI technology to offer features such as self-service analytics, predictive analytics,
and social media integration. However, the technology must be easy enough to use that business
users don’t need a Ph.D. to use it. In a successful business intelligence environment, the technology
is easy to use and provides the features and functionality needed to support the data gathering,
analysis, and reporting process.