0% found this document useful (0 votes)
15 views75 pages

Demgn801 Business Analytics 76 150

Uploaded by

Agus Gumilar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views75 pages

Demgn801 Business Analytics 76 150

Uploaded by

Agus Gumilar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Notes

Unit 04: Business Forecasting using Time Series

Trend analysis means determining consistent movement in a certain direction. There are two types
of trends: deterministic, where we can find the underlying cause, and stochastic, which is random
and unexplainable.
Seasonal variation describes events that occur at specific and regular intervals during the course of
a year. Serial dependence occurs when data points close together in time tend to be related.
Time series analysis and forecasting models must define the types of data relevant to answering the
business question. Once analysts have chosen the relevant data they want to analyze, they choose
what types of analysis and techniques are the best fit.
Important Considerations for Time Series Analysis
While time series data is data collected over time, there are different types of data that describe how
and when that time data was recorded. For example:
Time series data is data that is recorded over consistent intervals of time.
Cross-sectional data consists of several variables recorded at the same time.
Pooled data is a combination of both time series data and cross-sectional data.
Time Series Analysis Models and Techniques
Just as there are many types and models, there are also a variety of methods to study data. Here are
the three most common.
Box-Jenkins ARIMA models: These univariate models are used to better understand a single time-
dependent variable, such as temperature over time, and to predict future data points of variables.
These models work on the assumption that the data is stationary. Analysts have to account for and
remove as many differences and seasonalities in past data points as they can. Thankfully, the
ARIMA model includes terms to account for moving averages, seasonal difference operators, and
autoregressive terms within the model.
Box-Jenkins Multivariate Models: Multivariate models are used to analyze more than one time-
dependent variable, such as temperature and humidity, over time.
Holt-Winters Method: The Holt-Winters method is an exponential smoothing technique. It is
designed to predict outcomes, provided that the data points include seasonality.

4.7 Exploration of Time Series Data using R


Exploration of time series data using R can be done in several ways. Here are some of the key steps
and tools you can use:
Loading the data: The first step is to load your time series data into R. You can use the read.csv or
read.table function to read in data from a file, or you can use the ts() function to create a time series
object from data in a matrix or data frame.
Understanding the data: Once you have loaded your data, you should explore it to get a sense of
what it looks like. You can use the head(), tail(), and summary() functions to see the first and last
few rows of the data, as well as some basic summary statistics. You can also plot the data using
plot() or ggplot2() to get a visual understanding of the time series.
Decomposition: Time series data often has multiple components, including trend, seasonal, and
cyclical patterns. You can use the decompose() function to separate out these components and
explore them separately.
Smoothing: Time series data can be noisy, making it difficult to see trends and patterns. You can
use smoothing techniques such as moving averages or exponential smoothing to reduce noise and
highlight trends.
Stationarity: Many time series models require the data to be stationary, meaning that the statistical
properties of the data do not change over time. You can use tests such as the Augmented Dickey-
Fuller (ADF) test to check for stationarity.
Time series models: There are many models that can be used to analyze time series data, including
ARIMA, SARIMA, and Prophet. You can use functions such as arima(), auto.arima(), and prophet()
to fit these models and make predictions.

72 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Visualization: Finally, you can use various visualization techniques to explore the time series data
further. This can include plotting the data with different levels of smoothing or using techniques
such as time series forecasting to predict future values.
Overall, R provides a wide range of tools and techniques for exploring and analyzing time series
data, making it a powerful tool for time series analysis.
Time Series in R is used to see how an object behaves over a period of time. In R, it can be easily
done by ts() function with some parameters. Time series takes the data vector and each data is
connected with timestamp value as given by the user. This function is mostly used to learn and
forecast the behavior of an asset in business for a period of time. For example, sales analysis of a
company, inventory analysis, price analysis of a particular stock or market, population analysis, etc.
Syntax:
objectName <- ts(data, start, end, frequency)
where,
data represents the data vector
start represents the first observation in time series
end represents the last observation in time series
frequency represents number of observations per unit time. For example, frequency=1 for monthly
data.
Note: To know about more optional parameters, use the following command in R console:
help("ts")
Example: Let’s take the example of COVID-19 pandemic situation. Taking a total number of
positive cases of COVID-19 cases weekly from 22 January 2020 to 15 April 2020 of the world in
data vector.
# Weekly data of COVID-19 positive cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="timeSeries.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
col.main ="darkgreen")
# saving the file
dev.off()

LOVELY PROFESSIONAL UNIVERSITY 73


Notes

Unit 04: Business Forecasting using Time Series

Multivariate Time Series


Multivariate Time Series is creating multiple time series in a single chart.
Example: Taking data of total positive cases and total deaths from COVID-19 weekly from 22
January 2020 to 15 April 2020 in data vector.
# Weekly data of COVID-19 positive cases and
# weekly deaths from 22 January, 2020 to
# 15 April, 2020
positiveCases <- c(580, 7813, 28266, 59287,
75700, 87820, 95314, 126214,
218843, 471497, 936851,
1508725, 2072113)
deaths <- c(17, 270, 565, 1261, 2126, 2800,
3285, 4628, 8951, 21283, 47210,
88480, 138475)
# library required for decimal_date() function
library(lubridate)
# output to be created as png file
png(file ="multivariateTimeSeries.png")
# creating multivariate time series object
# from date 22 January, 2020
mts <- ts(cbind(positiveCases, deaths),
start = decimal_date(ymd("2020-01-22")),frequency = 365.25 / 7)
# plotting the graph

74 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

plot(mts, xlab ="Weekly Data",


main ="COVID-19 Cases",
col.main ="darkgreen")
# saving the file
dev.off()

4.8 Forecasting Using ARIMA Methodology


Forecasting can be done on time series using some models present in R. In this example, Arima
automated model is used. To know about more parameters of arima() function, use the below
command.
help("arima")
In the below code, forecasting is done using forecast library and so, installation of forecast library is
necessary.
# Weekly data of COVID-19 cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)
# library required for decimal_date() function
library(lubridate)
# library required for forecasting
library(forecast)
# output to be created as png file
png(file ="forecastTimeSeries.png")
# creating time series object
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# forecasting model using arima model

LOVELY PROFESSIONAL UNIVERSITY 75


Notes

Unit 04: Business Forecasting using Time Series

fit <- auto.arima(mts)


# Next 5 forecasted values
forecast(fit, 5)
# plotting the graph with next
# 5 weekly forecasted values
plot(forecast(fit, 5), xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic", col.main ="darkgreen")
# saving the file
dev.off()
Output :
After executing the above code, the following forecasted results are produced.
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020.307 2547989 2491957 2604020 2462296 2633682
2020.326 2915130 2721277 3108983 2618657 3211603
2020.345 3202354 2783402 3621307 2561622 3843087
2020.364 3462692 2748533 4176851 2370480 4554904
2020.383 3745054 2692884 4797225 2135898 5354210
Below graph plots estimated forecasted values of COVID-19 if it continues to be widespread for the
next 5 weeks.

4.9 Forecasting Using GARCH Methodology


GARCH (Generalized Autoregressive Conditional Heteroskedasticity) models are a class of time
series models that are commonly used for financial data analysis. These models are designed to

76 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

capture the volatility clustering and persistence in financial data by allowing the variance of the
series to be time-varying.
GARCH models build upon the autoregressive conditional heteroskedasticity (ARCH) framework,
which models the conditional variance of a series as a function of its past squared residuals. The
GARCH model extends this by including both past squared residuals and past conditional
variances in the conditional variance equation. This allows for the persistence of volatility to be
captured in the model.
There are several variations of GARCH models, including the original GARCH, the EGARCH
(Exponential GARCH), the TGARCH (Threshold GARCH), and the IGARCH (Integrated GARCH).
These models differ in how they model the conditional variance and incorporate additional features
such as asymmetry, leverage effects, and long-term memory.
GARCH models are estimated using maximum likelihood estimation, and model selection can be
done using criteria such as Akaike Information Criterion (AIC) or Bayesian Information Criterion
(BIC). Once a GARCH model has been estimated, it can be used for forecasting future volatility and
assessing risk.
GARCH models have become popular in finance because they are able to capture the complex
dynamics of financial data and provide accurate estimates of future volatility. However, they can be
computationally intensive to estimate and may require large amounts of data to achieve accurate
results. Additionally, GARCH models are limited by their assumption of normality in the
distribution of residuals, which may not be appropriate for all types of financial data.
Example
# Load the necessary packages
library(tseries)
library(rugarch)
# Load the data
data(msft)
# Estimate a GARCH(1,1) model
spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1,1)), mean.model =
list(armaOrder = c(0,0)))
fit <- ugarchfit(spec, data = msft$Close)
# Print the model summary
summary(fit)
# Make a forecast for the next 5 periods
forecast <- ugarchforecast(fit, n.ahead = 5)
# Plot the forecasted volatility
plot(forecast)
In this example, we load the msft data from the tseries package, which contains the daily closing
prices of Microsoft stock from January 1, 2005 to December 31, 2010. We then specify a GARCH(1,1)
model using the ugarchspec function from the rugarch package. The variance.model argument
specifies that we want to use a standard GARCH model (model = "sGARCH") with a lag of 1 for
both the autoregressive and moving average components (garchOrder = c(1,1)). The mean.model
argument specifies that we want to use a zero-mean model (armaOrder = c(0,0)).
We then fit the model to the data using the ugarchfit function and print the model summary. The
summary displays the estimated parameters of the model, as well as diagnostic statistics such as the
log-likelihood and the AIC.
Next, we use the ugarchforecast function to make a forecast for the next 5 periods. Finally, we plot
the forecasted volatility using the plot function.
*---------------------------------*
* GARCH Model Fit *

LOVELY PROFESSIONAL UNIVERSITY 77


Notes

Unit 04: Business Forecasting using Time Series

*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 27.60041 0.042660 647.008 0.00000
omega 0.06801 0.003504 19.419 0.00000
alpha1 0.10222 0.010072 10.153 0.00000
beta1 0.88577 0.006254 141.543 0.00000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 27.60041 0.031563 874.836 0.00000
omega 0.06801 0.004074 16.691 0.00000
alpha1 0.10222 0.010039 10.183 0.00000
beta1 0.88577 0.008246 107.388 0.00000
LogLikelihood : 2321.214
Information Criteria
------------------------------------

Akaike -13.979
Bayes -13.952
Shibata -13.979
Hannan-Quinn -13.969
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.525 0.468736
Lag[2*(p+q)+(p+q)-1][4] 1.705 0.074600
Lag[4*(p+q)+(p+q)-1][8] 2.396 0.001582
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.007 0.933987
Lag[2*(p+q)+(p+q)-1][4] 2.710 0.244550

78 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Lag[4*(p+q)+(p+q)-1][8] 4.084 0.193941


d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.1691 0.500 2.000 0.6806
ARCH Lag[5] 0.2474 1.440 1.667 0.9982
ARCH Lag[7] 0.8799 2.315 1.543 1.0000
Nyblom stability test
------------------------------------
Joint Statistic: 1.4489
Individual Statistics:
mu 0.3925
omega 0.0495
alpha1 0.6311
beta1 0.3752
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.58 1.88 2.54

4.10 Forecasting Using VAR Methodology


VAR (Vector Autoregression) is a statistical methodology for time series analysis that allows for
modeling the interdependencies between multiple time series variables. Here are the key steps
involved in using VAR for time series analysis:
Data preparation: Make sure your data is in a time series format with a consistent frequency. You
should also make sure your data is stationary by checking for trends, seasonality, and non-
stationary behavior.
Model specification: Specify the number of lags (p) that you want to include in the model and
whether you want to include a constant term. You can also specify different types of VAR models,
such as VARX (VAR with exogenous variables) or VEC (Vector Error Correction).
Estimation: Estimate the parameters of the VAR model using techniques such as OLS (Ordinary
Least Squares) or maximum likelihood estimation.
Diagnostics: Conduct diagnostic tests to evaluate the goodness of fit of the model, such as residual
diagnostics, serial correlation tests, and ARCH tests.
Forecasting: Use the estimated VAR model to generate forecasts for future time periods.
There are different software packages and programming languages that can be used to implement
VAR methodologies for time series analysis, such as R, Python, MATLAB, and EViews. Each of
these tools has their own syntax and functions for implementing VAR models, so it's important to
consult the relevant documentation for the software you are using.
Example
Load the necessary libraries:
library(vars)
library(tidyverse)
Prepare your data:
For this example, we will use the built-in R dataset "economics", which contains monthly U.S.
economic time series data from 1967 to 2015. We will use the variables "unemploy" (unemployment
rate) and "pop" (total population).

LOVELY PROFESSIONAL UNIVERSITY 79


Notes

Unit 04: Business Forecasting using Time Series

data("economics")
data_ts <- ts(economics[, c("unemploy", "pop")], start = c(1967, 1), frequency = 12)
Before estimating the VAR model, let's plot the data to visualize any trends or seasonality.
autoplot(data_ts) +
labs(title = "Unemployment Rate and Total Population in the US",
x = "Year",
y = "Value")
plot
From the plot, we can see that both variables have a clear upward trend, so we will need to take
first differences to make them stationary.
data_ts_diff <- diff(data_ts)
Now that our data is stationary, we can proceed to estimate the VAR model.
Estimate the VAR model:
We will estimate a VAR model with 2 lags and a constant term using the VAR() function from the
vars library.
var_model <- VAR(data_ts_diff, p = 2, type = "const")
We can use the summary() function to view a summary of the estimated VAR model.
summary(var_model)
The output will show the estimated coefficients, standard errors, t-values, and p-values for each lag
of each variable in the VAR model.
Evaluate the VAR model:
To evaluate the VAR model, we can perform a serial correlation test and an ARCH test using the
serial.test() and arch.test() functions from the vars library, respectively.
serial.test(var_model)
arch.test(var_model)
If the p-value of the serial correlation test is less than 0.05, it indicates the presence of serial
correlation in the residuals, which violates the assumptions of the VAR model. Similarly, if the p-
value of the ARCH test is less than 0.05, it indicates the presence of conditional heteroskedasticity
in the residuals.
Forecast using the VAR model:
To forecast future values of the variables, we can use the predict() function from the vars library.
var_forecast <- predict(var_model, n.ahead = 12)
This will generate a forecast for the next 12 months, based on the estimated VAR model. We can
visualize the forecast using the autoplot() function from the ggfortify library.Copy code
autoplot(var_forecast) +
labs(title = "Forecast of Unemployment Rate and Total Population in the US",
x = "Year",
y = "Value")
plot
This plot shows the forecasted values for the next 12 months for both variables in the VAR model.
We can see that the unemployment rate is expected to decrease slightly over the next year, while
the total population is expected to continue increasing.

80 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Summary
Business forecasting using time series involves using statistical methods to analyze historical data
and make predictions about future trends in business variables such as sales, revenue, and demand
for products or services. Time series analysis involves analyzing the pattern of the data over time,
including identifying trends, seasonal patterns, and cyclical fluctuations.
One popular approach to time series forecasting is the use of ARIMA (autoregressive integrated
moving average) models, which can capture trends and seasonal patterns in the data, as well as the
autocorrelation of the series. Another popular approach is the use of VAR (vector autoregression)
models, which can capture the interdependencies between multiple time series variables.
Business forecasting using time series can be used for a variety of purposes, such as predicting sales
for a specific product or service, forecasting future demand for inventory, and predicting overall
market trends. Accurate forecasting can help businesses make informed decisions about resource
allocation, inventory management, and overall business strategy.
To be effective, time series forecasting requires high-quality data, including historical data and
relevant external factors such as changes in the economy, weather patterns, or industry trends. In
addition, it's important to validate and test the accuracy of the forecasting models using historical
data before applying them to future predictions.
Overall, business forecasting using time series analysis can be a valuable tool for businesses looking
to make data-driven decisions and stay ahead of market trends.

Keywords
Time series: A collection of observations measured over time, typically at regular intervals.
Trend: A gradual, long-term change in the level of a time series.
Seasonality: A pattern of regular fluctuations in a time series that repeat at fixed intervals.
Stationarity: A property of a time series where the mean, variance, and autocorrelation structure
are constant over time.
Autocorrelation: The correlation between a time series and its own past values.
White noise: A type of time series where the observations are uncorrelated and have constant
variance.
ARIMA: A statistical model for time series data that includes autoregressive, differencing, and
moving average components.
Exponential smoothing: A family of time series forecasting models that use weighted averages of
past observations, with weights that decay exponentially over time.
Seasonal decomposition: A method of breaking down a time series into trend, seasonal, and
residual components.
Forecasting: The process of predicting future values of a time series based on past observations and
statistical models.

SelfAssessment
1. What is a time series?
A. A collection of observations measured over time
B. A statistical model for time series data
C. A method of breaking down a time series into trend, seasonal, and residual components
D. A type of time series where the observations are uncorrelated and have constant variance

2. What is trend in a time series?


A. A pattern of regular fluctuations in a time series that repeat at fixed intervals

LOVELY PROFESSIONAL UNIVERSITY 81


Notes

Unit 04: Business Forecasting using Time Series

B. A gradual, long-term change in the level of a time series


C. The correlation between a time series and its own past values
D. A property of a time series where the mean, variance, and autocorrelation structure are
constant over time

3. What is seasonality in a time series?


A. A type of time series where the observations are uncorrelated and have constant variance
B. A pattern of regular fluctuations in a time series that repeat at fixed intervals
C. A method of breaking down a time series into trend, seasonal, and residual components
D. A gradual, long-term change in the level of a time series

4. What is autocorrelation in a time series?


A. A method of breaking down a time series into trend, seasonal, and residual components
B. A type of time series where the observations are uncorrelated and have constant variance
C. The correlation between a time series and its own past values
D. A gradual, long-term change in the level of a time series

5. What is ARIMA?
A. A statistical model for time series data that includes autoregressive, differencing, and
moving average components
B. A method of breaking down a time series into trend, seasonal, and residual components
C. A family of time series forecasting models that use weighted averages of past observations
D. A type of time series where the observations are uncorrelated and have constant variance

6. What R package is commonly used for time series analysis?


A. ggplot2
B. dplyr
C. lubridate
D. forecast

7. How do you convert a data frame in R to a time series object?


A. Use the as.ts() function
B. Use the ts() function
C. Use the zoo() function
D. Use the xts() function

8. What is the difference between the ts() and zoo() functions in R?


A. ts() is used for time series data with regular time intervals, while zoo() is used for irregular
time intervals
B. ts() is used for time series data with irregular time intervals, while zoo() is used for regular
time intervals
C. ts() is a base R function, while zoo() is a package for time series analysis
D. There is no difference between ts() and zoo()

9. How do you plot a time series in R using the ggplot2 package?


A. Use the qplot() function

82 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

B. Use the ggplot() function with geom_line()


C. Use the autoplot() function
D. Use the plot() function with type = "l"

10. What is the purpose of the forecast() function in R?


A. To fit an ARIMA model to a time series
B. To compute forecasts for a time series using a chosen model
C. To decompose a time series into trend, seasonal, and residual components
D. To convert a data frame to a time series object

11. What R function can be used to calculate the autocorrelation function (ACF) and partial
autocorrelation function (PACF) for a time series?
A. acf()
B. pacf()
C. auto.correlation()
D. corr()

12. What is the purpose of the AIC() function in R?


A. To calculate the Akaike information criterion (AIC) for a given model
B. To calculate the autocorrelation function (ACF) for a time series
C. To compute forecasts for a time series using a chosen model
D. To plot a time series using the ggplot2 package
13. What is a stationary time series?
A. A time series where the mean, variance, and autocorrelation structure are constant over time
B. A time series where the observations are uncorrelated and have constant variance
C. A time series with a trend, but no seasonal or cyclic components
D. A time series with a seasonal pattern, but no trend or cyclic components

14. What R function can be used to detrend a time series?


A. diff()
B. ts.intersect()
C. decompose()
D. aggregate()

15. What is the purpose of the window() function in R?


A. To extract a subset of a time series based on a specified time range
B. To calculate the autocorrelation function (ACF) for a time series
C. To convert a data frame to a time series object
D. To compute forecasts for a time series using a chosen model

Answers for Self Assessment


l. A 2. B 3. B 4. C 5. A

6. D 7. B 8. A 9. C 10. B

LOVELY PROFESSIONAL UNIVERSITY 83


Notes

Unit 04: Business Forecasting using Time Series

11. A 12. A 13. A 14. C 15. A

Review Questions
1. What is a time series? How is it different from a cross-sectional data set?
2. What are some common patterns that can be observed in time series data?
3. What is autocorrelation? How can it be measured for a time series?
4. What is stationarity? Why is it important for time series analysis?
5. What is the difference between the additive and multiplicative decomposition of a time series?
6. What is a moving average model? How is it different from an autoregressive model?
7. What is the difference between white noise and a random walk time series?
8. How can seasonal patterns be modeled in a time series?
9. What is the purpose of the ARIMA model? How is it different from the ARMA model?
10. What is the purpose of the forecast package in R? What are some functions in this package that
can be used for time series analysis?
11. What is the purpose of cross-validation in time series analysis? How can it be implemented in
R?
12. What are some techniques for time series forecasting? How can they be implemented in R?

Further Reading
"Forecasting: Principles and Practice" by Rob J Hyndman and George Athanasopoulos -
This is an online textbook that covers the basics of time series forecasting using R. It
includes a lot of examples and code for different time series models, as well as practical
advice on how to apply them.
"Time Series Analysis and Its Applications: With R Examples" by Robert H. Shumway
and David S. Stoffer - This is a textbook that covers time series analysis and forecasting
using R. It covers a wide range of topics, from basic time series concepts to advanced
models and methods.
"Time Series Analysis in R" by A. Ian McLeod - This is a short book that covers the basics
of time series analysis in R. It includes examples of using R to analyze and model time
series data, as well as information on visualizing and interpreting time series plots.

84 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models
Dr. Mohd Imran Khan, Lovely Professional University

Unit 05: Business Prediction Using Generalised Linear Models


CONTENTS
Objective
Introduction
5.1 Linear Regression
5.2 Generalised Linear Models
5.3 Logistic Regression
5.4 Generalised Linear Models Using R
5.5 Statistical Inferences of GLM
5.6 Survival Analysis
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Reading

Objective
After studying this unit, student will be able to

 learn about the theory behind GLMs, including the selection of appropriate link functions
and the interpretation of model coefficients.
 gain practical experience in data analysis by working with real-world datasets and using
statistical software to fit GLM models and make predictions.
 to interpret the results of GLM analyses and communicate findings to others using clear
and concise language.
 think critically and solve complex problems, which can help develop important skills for
future academic and professional endeavors.

Introduction
Business prediction using generalized linear models (GLMs) is a common technique in data
analysis. GLMs extend the linear regression model to handle non-normal response variables by
using a link function to map the response variable to a linear predictor.
In business prediction, GLMs can be used to model the relationship between a response variable
and one or more predictor variables. The response variable could be a continuous variable, such as
sales or revenue, or a binary variable, such as whether a customer will make a purchase or not. The
predictor variables could be various business metrics, such as marketing spend, website traffic,
customer demographics, and more.
Generalized linear models (GLMs) can be used for business prediction in a variety of applications
such as marketing, finance, and operations. GLMs are a flexible statistical modeling framework that
can be used to analyze and make predictions about data that have non-normal distributions, such
as counts, proportions, and binary outcomes.

LOVELY PROFESSIONAL UNIVERSITY 85


Notes

Business Analytics

One common application of GLMs in business is to model customer behavior in marketing. For
example, a company might use GLMs to predict the likelihood of a customer responding to a
promotional offer, based on their demographic and behavioral data. This can help the company
optimize their marketing campaigns by targeting customers who are most likely to respond to their
offers.
GLMs can also be used in finance to predict the likelihood of default on a loan, based on the
borrower's credit history and other relevant variables. This can help banks and other financial
institutions make more informed lending decisions and manage their risk exposure.
In operations, GLMs can be used to predict the probability of defects or quality issues in a
manufacturing process, based on variables such as raw materials, production techniques, and
environmental factors. This can help companies optimize their production processes and reduce
waste and defects.
Overall, GLMs are a powerful tool for business prediction, providing a flexible and interpretable
framework for modeling a wide range of data types and outcomes.

5.1 Linear Regression


Linear regression is a statistical technique used to model the relationship between a dependent
variable and one or more independent variables. It is a popular and widely used method in various
fields such as economics, finance, engineering, social sciences, and many more.
In linear regression, the dependent variable is assumed to be a linear function of one or more
independent variables. The objective is to estimate the values of the coefficients in the linear
equation that best fit the observed data, so that we can use the equation to make predictions about
the dependent variable.
There are two types of linear regression: simple linear regression and multiple linear regression.
Simple linear regression involves only one independent variable, whereas multiple linear
regression involves two or more independent variables.
The most common approach to estimating the coefficients in linear regression is called the least
squares method. This involves minimizing the sum of the squared differences between the
observed values of the dependent variable and the predicted values based on the linear equation.
Linear regression has many applications in business, such as predicting sales based on advertising
expenditures, estimating the impact of price changes on demand, and modeling employee
productivity based on various factors such as education level, experience, and salary.
Overall, linear regression is a powerful tool for modeling and predicting relationships between
variables and is widely used in business and other fields.
Linear regression assumes that the relationship between the dependent variable and the
independent variable(s) is linear. This means that the change in the dependent variable is
proportional to the change in the independent variable(s). For example, if we are modeling the
relationship between salary and years of experience, we assume that the increase in salary is
proportional to the increase in years of experience.
Linear regression can also be used for prediction. After estimating the coefficients in the linear
equation, we can use the equation to predict the value of the dependent variable for new values of
the independent variable(s).
In addition to least squares, other methods can be used to estimate the coefficients in linear
regression. These include maximum likelihood estimation, Bayesian estimation, and gradient
descent.
Linear regression can be extended to handle nonlinear relationships between the dependent
variable and the independent variable(s) by adding polynomial terms or using other nonlinear
functions of the independent variable(s).
It's important to note that linear regression has assumptions that must be met for the results to be
valid. These include linearity, independence, homoscedasticity, and normality of errors. Violations
of these assumptions can lead to biased or unreliable results.

86 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

Overall, linear regression is a useful and widely used technique in data analysis and prediction,
with many applications in business, economics, social sciences, and other fields.

5.2 Generalised Linear Models


Generalized Linear Models (GLMs) are a statistical framework that extends the linear regression
model to handle non-normally distributed dependent variables. GLMs allow for a wide range of
data distributions, such as binary, count, and continuous data, to be modeled with a link function
that relates the mean of the response variable to the linear predictor.
GLMs have three components: a probability distribution for the response variable, a linear
predictor that relates the response variable to the predictor variables, and a link function that links
the mean of the response variable to the linear predictor. The choice of probability distribution and
link function depends on the nature of the data being modeled.
Examples of GLMs include logistic regression for binary data, Poisson regression for count data,
and gamma regression for continuous data with positive values. GLMs can also handle
overdispersion and under dispersion, which occur when the variance of the response variable is
larger or smaller than predicted by the distribution.
GLMs can be used for prediction and inference, and provide interpretable coefficients that can be
used to make conclusions about the effects of predictor variables on the response variable. GLMs
also allow for the modeling of interactions between predictor variables and non-linear relationships
between predictor variables and the response variable.
GLMs have many applications in various fields, such as marketing, epidemiology, finance, and
environmental studies, where the response variable is not normally distributed. Overall, GLMs are
a flexible and powerful tool for modeling a wide range of non-normally distributed data.
GLMs are based on the idea that the mean of the response variable depends on a linear
combination of the predictor variables, but the variance of the response variable can be modeled by
a probability distribution other than the normal distribution. The link function used in GLMs
connects the linear combination of predictor variables with the mean of the response variable and
can be linear or non-linear.
The choice of link function depends on the type of response variable and the research question. For
example, the logistic link function is used for binary response variables, while the log link function
is used for count data. The identity link function is used for continuous data with a normal
distribution, which makes the GLM equivalent to the linear regression model.
GLMs can be fit using maximum likelihood estimation, which involves finding the parameter
values that maximize the likelihood of the observed data given the model. The goodness of fit of a
GLM can be assessed using various methods, such as residual plots, deviance, and information
criteria.
GLMs can be extended to handle complex data structures, such as clustered or longitudinal data,
through the use of random effects or mixed-effects models. These models allow for the modeling of
within-cluster or within-subject correlation in the data.
GLMs have many applications in various fields, such as healthcare, social sciences, and ecology,
where the response variable is non-normally distributed. Examples of applications include
modeling disease prevalence, predicting student performance, and modeling species abundance.
Overall, GLMs are a flexible and powerful tool for modeling a wide range of non-normally
distributed data, allowing researchers to make predictions and draw inferences about the
relationships between predictor variables and the response variable.
GLMs can handle data that is not only non-normally distributed but also data that is not
continuous, such as binary or categorical data. In such cases, GLMs can use a link function to
transform the data to meet the assumptions of the model.
The link function in GLMs plays a critical role in transforming the response variable so that it can
be related to the linear predictor. Common link functions include the logit function for binary data,
the log function for count data, and the inverse function for continuous data with positive values.
GLMs can also account for the effect of covariates or predictor variables on the response variable.
This allows for the modeling of the relationship between the response variable and multiple

LOVELY PROFESSIONAL UNIVERSITY 87


Notes

Business Analytics

predictor variables. The coefficients of the GLM can be used to determine the direction and
magnitude of the effects of the predictor variables on the response variable.
In addition to logistic regression and Poisson regression, which are commonly used GLMs, other
types of GLMs include negative binomial regression, which can handle overdispersion in count
data, and ordinal regression, which can handle ordinal data.
GLMs require some assumptions, such as linearity between the predictor variables and the
response variable and independence of observations. Violations of these assumptions can lead to
biased or unreliable results.
Overall, GLMs are a useful and versatile statistical framework for modeling non-normally
distributed data, and can be applied in various fields. GLMs allow for the modeling of multiple
predictor variables and can be used for prediction and inference.
Suppose we want to model the relationship between a binary response variable (e.g., whether a
customer made a purchase or not) and several predictor variables (e.g., age, gender, income). We
can use logistic regression, a type of GLM, to model this relationship.
Data Preparation: First, we need to prepare our data. We will use a dataset containing information
on customers, including their age, gender, income, and whether they made a purchase or not. We
will split the data into training and testing sets, with the training set used to fit the model and the
testing set used to evaluate the model's performance.
Model Specification: Next, we need to specify the model. We will use logistic regression as our
GLM, with the binary response variable (purchase or not) modeled as a function of the predictor
variables (age, gender, income).
Model Fitting: We can fit the model to the training data using maximum likelihood estimation. The
coefficients of the logistic regression model can be interpreted as the log-odds of making a
purchase, given the values of the predictor variables.
Model Evaluation: We can evaluate the performance of the model using the testing data. We can
calculate metrics such as accuracy, precision, and recall to measure how well the model is able to
predict the outcome variable based on the predictor variables.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables or transforming the data using different link
functions.
Overall, building a GLM model involves data preparation, model specification, model fitting,
model evaluation, and model improvement. By following these steps, we can build a model that
accurately captures the relationship between the response variable and the predictor variables and
can be used to make predictions or draw inferences.

5.3 Logistic Regression


Logistic regression is a type of generalized linear model (GLM) used to model the probability of a
binary response variable, typically coded as 0 or 1. It is commonly used in various fields, such as
medicine, finance, and social sciences, to predict the likelihood of an event or outcome based on one
or more predictor variables.
Logistic regression models the relationship between the log odds of the binary response variable
and the predictor variables using a sigmoidal or "S-shaped" curve. The model estimates the
coefficients of the predictor variables, which represent the change in log odds of the response
variable for a one-unit increase in the predictor variable, holding all other predictor variables
constant.
Logistic regression assumes that the relationship between the log odds of the response variable and
the predictor variables is linear, and that the residuals (i.e., the differences between the predicted
and observed values of the response variable) are normally distributed. Additionally, logistic
regression assumes that the observations are independent of each other.
Logistic regression can be used for both prediction and inference. In the context of prediction,
logistic regression can be used to estimate the probability of the response variable given the values
of the predictor variables. In the context of inference, logistic regression can be used to test

88 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

hypotheses about the effect of the predictor variables on the response variable, such as whether the
effect is significant or not.
Overall, logistic regression is a useful and widely-used statistical technique for modeling binary
response variables and can be applied in various fields. It allows for the modeling of the
relationship between the response variable and multiple predictor variables, and provides
interpretable coefficients that can be used to draw conclusions about the effects of the predictor
variables on the response variable.
Example
Sure, here's an example of using logistic regression to model a binary response variable:
Suppose we want to model the probability of a customer making a purchase based on their age and
gender. We have a dataset containing information on several customers, including their age,
gender, and whether they made a purchase or not.
Data Preparation: First, we need to prepare our data. We will split the data into training and testing
sets, with the training set used to fit the model and the testing set used to evaluate the model's
performance. We will also preprocess the data by encoding the gender variable as a binary
indicator variable (e.g., 1 for female and 0 for male).
Model Specification: Next, we need to specify the logistic regression model. We will model the
probability of making a purchase as a function of age and gender. The logistic regression model
takes the form:
log(p/1-p) = β0 + β1(age) + β2(gender)
where p is the probability of making a purchase, β0 is the intercept term, β1 is the coefficient for
age, and β2 is the coefficient for gender.
Model Fitting: We can fit the logistic regression model to the training data using maximum
likelihood estimation. The coefficients of the model can be interpreted as the change in log odds of
making a purchase for a one-unit increase in age or a change in gender from male to female.
Model Evaluation: We can evaluate the performance of the logistic regression model using the
testing data. We can calculate metrics such as accuracy, precision, and recall to measure how well
the model is able to predict the outcome variable based on the predictor variables.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables or transforming the data using different link
functions.
Overall, logistic regression is a useful technique for modeling binary response variables and can be
used in various fields. In this example, we used logistic regression to model the probability of a
customer making a purchase based on their age and gender. By following the steps of data
preparation, model specification, model fitting, model evaluation, and model improvement, we can
build a model that accurately captures the relationship between the response variable and the
predictor variables and can be used for prediction or inference.

5.4 Generalised Linear Models Using R


In R, GLMs can be easily implemented using the glm() function. The glm() function takes several
arguments, including the response variable, predictor variables, and the chosen link function. The
summary() function can be used to obtain a summary of the GLM model, including coefficients,
standard errors, and p-values. Additionally, the predict() function can be used to make predictions
on new data using the fitted GLM model.
GLMs are a powerful tool for business prediction, as they can handle a wide range of response
variables and predictor variables, and can be easily implemented in R. However, as with any
statistical model, it is important to carefully evaluate the assumptions and limitations of the model,
and to consider alternative approaches if necessary.
Example 1
Suppose we have a dataset containing information on the age and income of individuals, as well as
whether or not they own a car (a binary variable). We want to model the relationship between car
ownership and age and income.

LOVELY PROFESSIONAL UNIVERSITY 89


Notes

Business Analytics

Data Preparation: First, we need to import our data into R. Let's assume our data is in a CSV file
named "car_ownership.csv". We can use the read.csv function to import the data:
car_data<- read.csv("car_ownership.csv")
Model Specification: Next, we need to specify the logistic regression model. We will use the glm
function in R to fit the model:
car_model<- glm(own_car ~ age + income, data = car_data, family = "binomial")
In this code, "own_car" is the response variable (binary variable indicating whether or not the
individual owns a car), and "age" and "income" are the predictor variables. The family argument
specifies the type of GLM to be fitted, in this case a binomial family for binary data.
Model Fitting: We can fit the model to our data using the summary function in R:
summary(car_model)
This will output a summary of the model, including the estimated coefficients and their standard
errors, as well as goodness-of-fit measures such as the deviance and AIC.
Model Evaluation: We can evaluate the performance of our model by calculating the predicted
probabilities of car ownership for each individual in our dataset:
car_prob<- predict(car_model, type = "response")
This will output a vector of predicted probabilities, one for each individual in the dataset. We can
then compare these probabilities to the actual car ownership status to evaluate the accuracy of the
model.
Model Improvement: If the model performance is not satisfactory, we can consider improving the
model by adding or removing predictor variables, transforming the data, or using a different link
function.
Overall, performing logistic regression in R involves importing the data, specifying the model using
the glm function, fitting the model using the summary function, evaluating the model by
calculating predicted probabilities, and improving the model if necessary.
Example2
Here's an example of how to perform logistic regression in R using the built-in "mtcars" dataset:
Data Preparation: We can load the "mtcars" dataset and convert the response variable "am" (which
indicates whether a car has an automatic or manual transmission) to a binary indicator variable (0
for automatic and 1 for manual):
data(mtcars)
mtcars$am<- ifelse(mtcars$am == 0, 1, 0)
Model Specification: We can specify the logistic regression model using the "glm()" function:
model <- glm(am ~ hp + wt, data = mtcars, family = binomial)
In this model, we are predicting the probability of a car having a manual transmission based on its
horsepower ("hp") and weight ("wt").
Model Fitting: We can fit the logistic regression model using the "summary()" function to view the
estimated coefficients and their significance:
summary(model)
This will output the estimated coefficients for the intercept, "hp", and "wt", as well as their standard
errors, z-values, and p-values.
Model Evaluation: We can evaluate the performance of the logistic regression model using the
"predict()" function to obtain predicted probabilities for the testing data, and then calculate metrics
such as accuracy, precision, and recall to measure how well the model is able to predict the
outcome variable based on the predictor variables.
probs <- predict(model, newdata = mtcars, type = "response")
preds <- ifelse(probs > 0.5, 1, 0)

90 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

accuracy <- mean(preds == mtcars$am)


In this example, we are using the entire dataset for both training and testing purposes. However,
it's important to note that this is not best practice, and we should split the data into training and
testing sets to avoid overfitting.
Overall, logistic regression is a useful technique for modeling binary response variables in R, and
the built-in "glm()" function makes it easy to specify and fit models. By following the steps of data
preparation, model specification, model fitting, model evaluation, and model improvement, we can
build a model that accurately captures the relationship between the response variable and the
predictor variables and can be used for prediction or inference.

5.5 Statistical Inferences of GLM


Suppose we have a dataset containing information on the age and income of individuals, as well as
whether or not they own a car (a binary variable). We want to model the relationship between car
ownership and age and income using a logistic regression model.
Model Specification and Fitting: First, we need to specify and fit the logistic regression model
using the glm function in R:
car_model<- glm(own_car ~ age + income, data = car_data, family = "binomial")
In this code, "own_car" is the response variable (binary variable indicating whether or not the
individual owns a car), and "age" and "income" are the predictor variables. The family argument
specifies the type of GLM to be fitted, in this case a binomial family for binary data.
Hypothesis Testing: We can test the significance of the predictor variables by performing
hypothesis tests on their coefficients. For example, to test the hypothesis that the coefficient for
"age" is zero, we can use the t.test function in R:
t.test(car_model$coefficients[2], alternative = "two.sided", mu = 0, conf.level = 0.95)
This code will perform a two-sided t-test with a null hypothesis that the coefficient for "age" is zero,
and an alternative hypothesis that it is not zero. The conf.level argument specifies a 95% confidence
interval.
Confidence Intervals: We can calculate confidence intervals for the model parameters using the
confint function in R:
confint(car_model, level = 0.95)
This code will output a table of confidence intervals for each of the model parameters with a 95%
confidence level.
Goodness-of-Fit Tests: We can perform goodness-of-fit tests to assess how well the model fits the
data. For example, to perform a deviance goodness-of-fit test, we can use the following code:
pchisq(deviance(car_model), df = df.residual(car_model), lower.tail = FALSE)
This code calculates the p-value for the deviance goodness-of-fit test using the chi-square
distribution. If the p-value is less than the significance level (e.g., 0.05), we can reject the null
hypothesis that the model fits the data well.
Residual Analysis: We can analyze the residuals to assess the appropriateness of the model. For
example, we can plot the residuals against the predicted values using the plot function in R:
plot(car_model, which = 1)
This code will plot the residuals against the predicted values. We can examine the plot to look for
any patterns or trends in the residuals that might indicate that the model is not capturing all the
important features of the data.
Overall, performing statistical inferences on a GLM involves testing the significance of predictor
variables, calculating confidence intervals for model parameters, performing goodness-of-fit tests,
and analyzing residuals. These inferences help us assess the validity of the model and draw valid
conclusions from the model.

LOVELY PROFESSIONAL UNIVERSITY 91


Notes

Business Analytics

5.6 Survival Analysis


Survival refers to the ability of an organism to continue living in a given environment or situation.
It is the state of being able to withstand adverse conditions and remain alive.
Survival can refer to many different contexts, including:
Physical survival: This refers to the ability to maintain basic bodily functions such as breathing,
circulating blood, and regulating body temperature.
Emotional survival: This refers to the ability to cope with difficult or traumatic experiences, such as
abuse, trauma, or loss.
Financial survival: This refers to the ability to manage one's finances and maintain a stable income.
Survival in nature: This refers to the ability of an animal or plant to adapt to its environment and
avoid being killed by predators or natural disasters.
In general, survival requires a combination of factors, including physical strength, mental fortitude,
resilience, and adaptability. It is an essential aspect of life, and individuals and species that are
better able to survive are more likely to thrive and pass on their genes to future generations.

Survival Analysis Using R


Survival analysis in R Programming Language deals with the prediction of events at a specified
time. It deals with the occurrence of an interesting event within a specified time and failure of it
produces censored observations i.e incomplete observations.
Biological sciences are the most important application of survival analysis in which we can predict
the time for organisms eg. when they will multiply to sizes etc.
Methods used to do survival analysis:
There are two methods that can be used to perform survival analysis in R programming language:

 Kaplan-Meier method
 Cox Proportional hazard model

Kaplan-Meier Method
The Kaplan-Meier method is used in survival distribution using the Kaplan-Meier estimator for
truncated or censored data. It’s a non-parametric statistic that allows us to estimate the survival
function and thus not based on underlying probability distribution. The Kaplan–Meier estimates
are based on the number of patients (each patient as a row of data) from the total number of
patients who survive for a certain time after treatment. (which is the event).
We represent the Kaplan–Meier function by the formula:

Here S(t) represents the probability that life is longer than t with ti(At least one event happened), di
represents the number of events(e.g. deaths) that happened in time ti and ni represents the number
of individuals who survived up to time ti.

Example:
We will use the Survival package for the analysis. Using Lung dataset preloaded in survival
package which contains data of 228 patients with advanced lung cancer from North Central cancer
treatment group based on 10 features. The dataset contains missing values so, missing value
treatment is presumed to be done at your side before the building model.

92 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

# Installing package
install.packages("survival")
# Loading package
library(survival)
# Dataset information
?lung
# Fitting the survival model
Survival_Function = survfit(Surv(lung$time, lung$status == 2)~1)
Survival_Function
# Plotting the function
plot(Survival_Function)
Here, we are interested in “time” and “status” as they play an important role in the analysis. Time
represents the survival time of patients. Since patients survive, we will consider their status as dead
or non-dead(censored).
The Surv() function takes two times and status as input and creates an object which serves as the
input of survfir() function. We pass ~1 in survfit() function to ensure that we are telling the function
to fit the model on basis of the survival object and have an interrupt.
survfit() creates survival curves and prints the number of values, number of events(people
suffering from cancer), the median time and 95% confidence interval. The plot gives the following
output:

Here, the x-axis specifies “Number of days” and the y-axis specifies the “probability of survival“.
The dashed lines are upper confidence interval and lower confidence interval.
We also have the confidence interval which shows the margin of error expected i.e In days of
surviving 200 days, upper confidence interval reaches 0.76 or 76% and then goes down to 0.60 or
60%.

Cox proportional hazard model


It is a regression modeling that measures the instantaneous risk of deaths and is bit more difficult to
illustrate than the Kaplan-Meier estimator. It consists of hazard function h(t) which describes the
probability of event or hazard h(e.g. survival) up to a particular time t. Hazard function considers
covariates(independent variables in regression) to compare the survival of patient groups.
It does not assume an underlying probability distribution but it assumes that the hazards of the
patient groups we compare are constant over time and because of this it is known as “Proportional
hazard model“.

Example:
We will use the Survival package for the analysis. Using Lung dataset preloaded in survival
package which contains data of 228 patients with advanced lung cancer from North Central cancer
treatment group based on 10 features. The dataset contains missing values so, missing value
treatment is presumed to be done at your side before the building model. We will be using the cox
proportional hazard function coxph() to build the model.

LOVELY PROFESSIONAL UNIVERSITY 93


Notes

Business Analytics

# Installing package
install.packages("survival")
# Loading package
library(survival)
# Dataset information
?lung
# Fitting the Cox model
Cox_mod<- coxph(Surv(lung$time, lung$status == 2)~., data = lung)
# Summarizing the model
summary(Cox_mod)
# Fitting survfit()
Cox <- survfit(Cox_mod)
Here, we are interested in “time” and “status” as they play an important role in the analysis. Time
represents the survival time of patients. Since patients survive, we will consider their status as dead
or non-dead(censored).
The Surv() function takes two times and status as input and creates an object which serves as the
input of survfir() function. We pass ~1 in survfit() function to ensure that we are telling the function
to fit the model on basis of the survival object and have an interrupt.
# Plotting the function
plot(Cox)
The Cox_mod output is similar to the regression model. There are some important features like age,
sex, ph.ecog and wt. loss. The plot gives the following output:

Here, the x-axis specifies “Number of days” and the y-axis specifies “probability of survival“. The
dashed lines are upper confidence interval and lower confidence interval. In comparison with the
Kaplan-Meier plot, the Cox plot is high for initial values and lower for higher values because of
more variables in the Cox plot.
We also have the confidence interval which shows the margin of error expected i.e In days of
surviving 200 days, the upper confidence interval reaches 0.82 or 82% and then goes down to 0.70
or 70%.
Note: Cox model serves better results than Kaplan-Meier as it is most volatile with data and
features. Cox model is also higher for lower values and vice-versa i.e drops down sharply when the
time increases.

Detailed Example of Survival Analysis


Step 1: Load the necessary packages using the library() function.
The survival package is the core package for survival analysis in R, and it provides many functions
for fitting and visualizing survival models. The survminer package provides additional
visualization functions that are useful for interpreting survival analysis results.
library(survival) # Load the survival package

94 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

library(survminer) # Load the survminer package


Step 2: Import your dataset into R.
The read.csv() function can be used to import a dataset in CSV format into R. In this example, we
will use the lung dataset, which is included in the survival package:
data(lung) # Load the lung dataset
head(lung) # View the first few rows of the dataset
Step 3: Create a survival object using the Surv() function.
The Surv() function is used to create a survival object from the time-to-event variable and the event
indicator variable in your dataset. In this example, we will use the time and status variables from
the lung dataset:
surv.obj <- Surv(time = lung$time, event = lung$status)
Step 4: Fit a survival model to your data using a function such as survfit(), coxph(), or phreg().
The survfit() function is used to fit a Kaplan-Meier survival curve to the data. In this example, we
will fit separate survival curves for the sex variable in the lung dataset:
fit <- survfit(surv.obj ~ sex, data = lung)
The coxph() function is used to fit a Cox proportional hazards regression model to the data. In this
example, we will fit a Cox model to the age and sex variables in the lung dataset:
cox.model<- coxph(surv.obj ~ age + sex, data = lung)
Step 5: Visualize the results using functions such as ggsurvplot() and survplot().
The ggsurvplot() function is used to create a graphical representation of the Kaplan-Meier survival
curve. In this example, we will create a plot of the survival curves for the sex variable in the lung
dataset:
ggsurvplot(fit, data = lung, risk.table = TRUE)
The survplot() function is used to create a plot of the cumulative hazard function. In this example,
we will create a plot of the cumulative hazard function for the Cox model:
survplot(cox.model, fun = "cumhaz")
Step 6: Perform further analyses, such as Cox proportional hazards regression or competing risks
analysis, using the appropriate functions in R.
The coxph() function can be used to fit a Cox proportional hazards regression model to the data. In
this example, we will fit a Cox model to the age and sex variables in the lung dataset:
cox.model<- coxph(surv.obj ~ age + sex, data = lung)
The coxph() function also allows for the inclusion of interaction terms and time-dependent
covariates in the model.
The cmprsk package can be used to conduct competing risks analysis in R. Competing risks occur
when an individual may experience multiple possible outcomes, and the occurrence of one
outcome may preclude the occurrence of other outcomes.

Keywords
Response variable: The variable of interest that is being modeled and predicted by a GLM. It can
be continuous, binary, count, or ordinal.
Predictor variable: The variable(s) used to explain the variation in the response variable. They can
be continuous, binary, or categorical.
Link function: A function used to relate the mean of the response variable to the linear predictor. It
can be used to transform the response variable to a different scale or to model the relationship
between the predictor and response variables.

LOVELY PROFESSIONAL UNIVERSITY 95


Notes

Business Analytics

Exponential family: A class of probability distributions that includes the Poisson, binomial,
gamma, and normal distributions, among others. GLMs are based on the exponential family of
distributions.
Maximum likelihood estimation: A method used to estimate the parameters of a GLM by finding
the values that maximize the likelihood of the observed data.
Goodness of fit: A measure of how well a GLM model fits the observed data. It can be evaluated
using deviance, residual analysis, and other methods.
Residual analysis: A method used to check the assumptions of a GLM model and identify potential
problems such as outliers and influential observations.
Model selection: A process of comparing different GLM models and selecting the best one based
on their fit and complexity, using AIC/BIC, likelihood ratio tests, and other methods.

SelfAssessment
1. What package is the core package for survival analysis in R?
A. ggplot2
B. dplyr
C. survival
D. tidyr

2. Which function is used to create a survival object in R?


A. Surv()
B. plot()
C. summary()
D. lm()

3. Which function is used to fit a Kaplan-Meier survival curve to the data?


A. coxph()
B. phreg()
C. survfit()
D. flexsurv()

4. Which package provides additional visualization functions for interpreting survival analysis
results?
A. ggplot2
B. dplyr
C. survival
D. survminer

5. Which function is used to fit a Cox proportional hazards regression model to the data?
A. coxph()
B. phreg()
C. survfit()
D. flexsurv()

6. Which of the following is a characteristic of Generalized Linear Models (GLMs)?


A. The response variable must be continuous

96 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

B. GLMs cannot be used for non-linear relationships between the predictor and response
variables
C. The variance of the response variable can be non-constant
D. GLMs assume a linear relationship between the predictor and response variables

7. Which of the following families is used for binary data in GLMs?


A. Gaussian
B. Poisson
C. Binomial
D. Exponential

8. Which of the following is a method for selecting the best predictors in a GLM?
A. Principle Component Analysis (PCA)
B. Multiple Linear Regression
C. Stepwise Regression
D. Analysis of Variance (ANOVA)

9. Which of the following is a technique used to check the assumptions of a GLM?


A. Principal Component Analysis (PCA)
B. Residual Analysis
C. Discriminant Analysis
D. Analysis of Variance (ANOVA)
10. Which of the following is not a type of GLM?
A. Linear Regression
B. Logistic Regression
C. Poisson Regression
D. Negative Binomial Regression

11. Which R package is commonly used for fitting GLMs?


A. dplyr
B. ggplot2
C. caret
D. glm

12. What is the function used to fit a Poisson regression model in R?


A. glm()
B. lm()
C. gam()
D. lme()

13. What does the family argument in the glm() function specify?
A. The type of distribution for the response variable
B. The type of distribution for the predictor variable
C. The type of link function to use
D. The type of loss function to use

LOVELY PROFESSIONAL UNIVERSITY 97


Notes

Business Analytics

14. How can you check the goodness of fit of a GLM model in R?
A. Using the summary() function
B. Using the anova() function
C. Using the plot() function
D. Using the residuals() function

15. Which R package is commonly used for visualizing GLM results?


A. ggplot2
B. caret
C. lme4
D. MASS

Answers forSelfAssessment

l. C 2. A 3. C 4. D 5. A

6. C 7. C 8. C 9. B 10. A

11. D 12. A 13. A 14. C 15. A

Review Questions
1. A hospital wants to determine the factors that affect the length of stay for patients. What
type of GLM would be appropriate for this analysis?
2. A manufacturing company is interested in modeling the number of defective items
produced per day. What type of GLM would be appropriate for this analysis?
3. A bank is interested in predicting the probability of default for a loan applicant. What type
of GLM would be appropriate for this analysis?
4. A marketing company wants to model the number of clicks on an online advertisement.
What type of GLM would be appropriate for this analysis?
5. A sports team is interested in predicting the probability of winning a game based on the
number of goals scored. What type of GLM would be appropriate for this analysis?
6. A social scientist wants to model the number of criminal incidents per month in a city.
What type of GLM would be appropriate for this analysis?
7. What is survival analysis and what types of data is it typically used for?
8. What is a Kaplan-Meier survival curve, and how can it be used to visualize survival data?
9. What is the Cox proportional hazards regression model, and what types of data is it
appropriate for analyzing?
10. What is a hazard ratio, and how is it calculated in the context of the Cox proportional
hazards model?
11. How can the results of a Cox proportional hazards regression model be interpreted, and
what types of conclusions can be drawn from the analysis?
12. How can competing risks analysis be conducted in R, and what types of outcomes is it
appropriate for analyzing?

98 LOVELY PROFESSIONAL UNIVERSITY


Notes
Unit 05: Business Prediction Using Generalised Linear Models

13. What are some common visualizations used in survival analysis, and how can they be
created using R functions?
14. What are some potential sources of bias in survival analysis, and how can they be
addressed or minimized?

Further Reading
 The GLM section of UCLA's Institute for Digital Research and Education website
provides a detailed introduction to GLMs, along with examples and tutorials using
different statistical software packages such as R and Stata:
https://stats.idre.ucla.edu/r/dae/generalized-linear-models/
 The CRAN Task View on "Distributions and Their Inference" provides a
comprehensive list of R packages related to GLMs, along with their descriptions
and links to documentation: https://cran.r-
project.org/web/views/Distributions.html
 "Generalized Linear Models" by P. McCullagh and J. A. Nelder: This is a classic
book on GLMs and provides a thorough treatment of the theory and applications of
GLMs.
 "An Introduction to Generalized Linear Models" by A. Dobson: This book is a
concise introduction to GLMs and covers the key concepts and methods in a clear
and accessible way

1. .

LOVELY PROFESSIONAL UNIVERSITY 99


Notes
Dr. Mohd Imran Khan, Lovely Professional University Unit 06: Machine Learning for Businesses

Unit 06: Machine Learning for Businesses


CONTENTS
Objective
Introduction
6.1 Machine Learning
6.2 Use cases of Machine Learning in Businesses
6.3 Supervised Learning
6.4 Steps in Supervised Learning
6.5 Supervised Learning Using R
6.6 Supervised Learning using KNN
6.7 Supervised Learning using Decision Tree
6.8 Unsupervised Learning
6.9 Steps in Un-Supervised Learning
6.10 Unsupervised Learning Using R
6.11 Unsupervised learning using K-means
6.12 Unsupervised Learning using Hierarchical Clustering
6.13 Classification and Prediction Accuracy in Unsupervised Learning
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Question
Further readings

Objective
After studying this student will be able to

 can develop and apply machine learning algorithms and models.


 increase earning potential and have a higher chance of securing a well-paying job.
 analyze large amounts of data and develop algorithms to extract meaningful insights.
 solve complex problems and develop solutions.
 gain proficiency in data handling and preprocessing techniques.

Introduction
Machine learning is a field of artificial intelligence that has been rapidly growing in recent years,
and has already had a significant impact on many industries.At its core, machine learning involves
the development of algorithms and models that can learn patterns in data, and then use those
patterns to make predictions or decisions about new data. There are several different types of
machine learning, including supervised learning, unsupervised learning, and reinforcement
learning. Each of these types of machine learning has its own strengths and weaknesses, and is
suited to different types of problems.

100 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

One of the most important applications of machine learning is in the field of natural language
processing (NLP). NLP involves using machine learning to analyze and understand human
language, and is used in applications such as chatbots, voice assistants, and sentiment analysis.
NLP is also important in fields such as healthcare, where it can be used to extract useful
information from patient records and other medical documents.
Another important application of machine learning is in computer vision. This involves using
machine learning to analyze and interpret visual data and is used in applications such as image and
video recognition, facial recognition, and object detection. Computer vision is important in fields
such as self-driving cars, where it is used to help vehicles navigate and avoid obstacles.
Predictive modeling is another important application of machine learning. This involves using
machine learning to make predictions based on data, and is used in applications such as fraud
detection, stock market prediction, and customer churn prediction. Predictive modeling is
important in fields such as marketing, where it can be used to identify customers who are likely to
leave a company and take steps to retain them.
The potential for machine learning is enormous, and its applications are likely to continue to
expand in the coming years. One area where machine learning is likely to have a significant impact
is in healthcare. Machine learning can be used to analyze patient data and identify patterns that
could be used to diagnose and treat a wide range of diseases. Machine learning can also be used to
identify patients who are at high risk of developing certain conditions and take steps to prevent
those conditions from occurring.
Another area where machine learning is likely to have a significant impact is in education. Machine
learning can be used to analyze student data and identify patterns that could be used to improve
learning outcomes. Machine learning can also be used to personalize learning for individual
students, by adapting the pace and style of instruction to their individual needs.
In conclusion, machine learning is a rapidly growing field with many exciting applications. Its
ability to learn from data and make predictions or decisions based on that data has already had a
significant impact on many industries, and its potential for the future is enormous. As more data
becomes available and more powerful computing resources become available, machine learning is
likely to continue to grow in importance and have a significant impact on many aspects of our lives.

6.1 Machine Learning


With the explosive growth of data science, more and more businesses are focusing on improve their
business processes by harnessing the power of technologies like
Big data
Machine learning (ML)
Artificial intelligence
Among them, machine learning is a technology that helps businesses effectively gain insights from
raw data. Machine learning—specifically machine learning algorithms—can be used to iteratively
learn from a given data set, understand patterns, behaviors, etc., all with little to no programming.
This iterative and constantly evolving nature of the machine learning process helps businesses
ensure that they are always up to date with business and consumer needs. plus, it’s easier than ever
to build or integrate ML into existing business processes since all the major cloud providers offer
ML platforms.
So, in this article, we will dive into how machine learning benefits businesses of all shapes and
sizes.
Before we look at ML benefits, we need to have a basic understanding of how ML works. Machine
learning refers to the process of extracting meaningful data from raw data sets.
For example, let’s consider an online retail store that captures the user behavior and purchases
within the website. This is merely data. But machine learning plays a significant role, enabling the
online store to analyze and extract the patterns, stats, information, and stories hidden within this
data.
A key factor that differentiates machine learning from regular analytical algorithms is its
adaptability. Machine learning algorithms are constantly evolving. The more data the ML
algorithm consumes, the more accurate their analytics and predictions will be.

LOVELY PROFESSIONAL UNIVERSITY 101


Notes
Unit 06: Machine Learning for Businesses

Harnessing the power of machine learning has enabled businesses to:


More easily adapt to ever-changing market conditions
Improve business operations
Gain a greater understanding of the overall business and consumer needs
Machine learning is quickly becoming ubiquitous across all industries from agriculture to medical
research, stock market, traffic monitoring, etc. For instance, machine learning can be utilized in
agriculture for various tasks such as predicting weather patterns and crop rotation.
Machine learning can be combined with artificial intelligence to enhance the analytical process
gaining further benefits to businesses. Services like Azure Machine Learning and Amazon
SageMaker enable users to utilize the power of cloud computing to integrate ML to suit any
business need.

Common machine learning algorithms


A number of machine learning algorithms are commonly used. These include:
Neural networks: Neural networks simulate the way the human brain works, with a huge
number of linked processing nodes. Neural networks are good at recognizing patterns and play an
important role in applications including natural language translation, image recognition, speech
recognition, and image creation.
Linear regression: This algorithm is used to predict numerical values, based on a linear
relationship between different values. For example, the technique could be used to predict house
prices based on historical data for the area.
Logistic regression: This supervised learning algorithm makes predictions for categorical
response variables, such as“yes/no” answers to questions. It can be used for applications such as
classifying spam and quality control on a production line.
Clustering: Using unsupervised learning, clustering algorithms can identify patterns in data so
that it can be grouped. Computers can help data scientists by identifying differences between data
items that humans have overlooked.
Decision trees: Decision trees can be used for both predicting numerical values (regression) and
classifying data into categories. Decision trees use a branching sequence of linked decisions that can
be represented with a tree diagram. One of the advantages of decision trees is that they are easy to
validate and audit, unlike the black box of the neural network.
Random forests: In a random forest, the machine learning algorithm predicts a value or category
by combining the results from a number of decision trees.

6.2 Use cases of Machine Learning in Businesses


Below are some of the most apparent benefits of machine learning for businesses:

Optimizing marketing campaigns and detecting spam


Customer segmentation and content personalization make it possible to optimize marketing
campaigns. How? Machine learning gives businesses insights to improve ad targeting and
marketing management.
Spam detection is another excellent application of machine learning, with these solutions having
been used for a long time. Before ML and deep learning, email service companies set specific
criteria for classifying a message as spam. These days, the filters automatically generate new rules
based on neural networks – faster than ever before.

Individualization & predictability


We’ve all seen it before. The day begins with you visiting Amazon, reading the product description,
and picking out an iPad. A day later, you see an ad on Facebook for the same iPad model. You start
noticing it everywhere. Spooky, right?

102 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

That is what AI and machine learning is doing for you. These technologies let advertisers modify
the way they market. AI is changing the e-commerce landscape in a significant way, giving
marketers the advantage of tailoring their marketing strategies while also saving businesses a lot of
money.
The retail industry has reduced overstock, improved shipping times, and cut returns by 3 million
times as a result of artificial intelligence. Current trends suggest that machines will be able to
supplement your staff’s weak spots in the future without having to resort to mass firings.
The increasing use of artificial intelligence will likely continue to affect the advertising industry.
With machine learning, marketers will get a deeper understanding of their customers’ minds and
hearts and will easily create communications layouts tailored to each customer.

Recruiting & HR process improvement


Machine learning and artificial intelligence will almost certainly dominate recruitment as well. AI
technologies have advanced substantially since their introduction. As a result, it reduces repetitive
tasks, speeding up lots of processes.
Meanwhile, AI-enabled monitoring systems and HRMs are available, which enable businesses to
develop job search engines, identify the most qualified individuals, browse resumes effectively, and
conduct interviews without forcing candidates to come into the office.

Predicting the customer’s lifetime value


Today’s businesses have access to massive volumes of data that can be used to generate valuable
business insights. Customer information makes up a substantial amount of company data.
Analyzing it may allow you to learn more about customers, including their purchasing habits,
demands, and requirements. A customer lifetime value estimate is a valuable tool to provide
personalized offers to your customers.

Automates data entry


Duplicated and erroneous data are two of the most severe issues today’s organizations face.
Manual data entry errors can be drastically reduced using predictive modeling methods and
machine learning. As a result, employees can spend the on tasks that bring more value to the
company.

Financial analysis
Using machine learning algorithms, financial analytics can accomplish simple tasks, like estimating
business spending and calculating costs. The jobs of algorithmic traders and fraud detectors are

LOVELY PROFESSIONAL UNIVERSITY 103


Notes
Unit 06: Machine Learning for Businesses

both challenging. For each of these scenarios, historical data is examined to forecast future results
as accurately as possible.
In many cases, a small set of data and a simple machine learning algorithm can be sufficient for
simple tasks like estimating a business’s expenses. It’s worthwhile to note that stock traders and
dealers rely heavily on ML to accurately predict market conditions before entering the market.
Organizations can control their overall costs and maximize profits with accurate and timely
projections. When combined with automation, user analytics will result in significant cost savings.

Diagnosis of medical condition


With the help of unique diagnostic tools and successful treatment strategies, ML in medical
diagnosis has assisted several healthcare organizations in improving patient health and reducing
healthcare costs.
Hospitals, clinics, and medical organizations, in general, use it to produce near-perfect diagnoses,
predict readmissions, prescribe medications, and identify high-risk patients. The forecasts and
insights are derived from patient records, ethically-sourced data sources, and symptoms.

Strengthening cyber security


According to a recent McAfee report, cybercrime costs have surpassed $1.5 trillion worldwide since
2018. Meanwhile, hacking, phishing, or any sort of mischievous activity might result in you losing
much more than your money. It can be very detrimental to the reputation of your brand and the
privacy of your employees and customers if there is a data leak.
Analytics systems that assure data security and overall cybersecurity are powered by machine
learning. ML-based solutions keep administrators up at night by monitoring activities and trying to
identify odd user behavior, unauthorized access, breaches, fraud, system weaknesses, and various
other issues.
This feature makes machine learning (ML) extremely valuable, especially for financial
organizations.

Increasing customer satisfaction


With the use of machine learning, customer loyalty and customer experience can be improved. In
this case, customer behavior is assessed in past call records, allowing a person or a system to
accurately assign the client’s request to the most suitable customer service representative.
Thus, the burden of customer relationship management is significantly reduced through this
assessment. Due to these reasons, corporations employ predictive algorithms to provide spot-on
product recommendations to their clients.

Cognitive services
Another important use of machine learning in businesses is secure and intuitive authentication
procedures through computer vision, image recognition, and natural language processing. What is
more, businesses can reach a lot wider audiences, as NLP allows access to multiple geographic
locations, language holders, and ethnic groups.
Another example of cognitive services is automatic or self-checkouts. Because of machine learning,
we have upgraded retail experiences, with Amazon Go being the perfect example.

6.3 Supervised Learning


Supervised learning is a type of machine learning in which an algorithm is trained on a labeled
dataset. In supervised learning, the input data is accompanied by the correct output or label, which
the algorithm uses to learn a mapping between the input and the output.
Supervised learning is used in a wide range of applications, including image classification, speech
recognition, natural language processing, and fraud detection. It is a powerful tool for building
predictive models that can automate decision-making processes and improve efficiency in a variety
of industries. Supervised learning is a type of machine learning in which the algorithm learns from
labeled data. Here are some examples of supervised learning:

104 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Image Classification: Given a set of labeled images, the algorithm learns to classify new images
into the correct category. For example, a cat vs dog image classification task.

Sentiment Analysis: Given a set of labeled text data, the algorithm learns to classify new text
data as positive, negative, or neutral sentiment.
Spam Detection: Given a set of labeled emails, the algorithm learns to classify new emails as
spam or not spam.
Language Translation: Given a set of labeled pairs of sentences in two different languages, the
algorithm learns to translate new sentences from one language to the other.

Fraud Detection: Given a set of labeled transactions, the algorithm learns to classify new
transactions as fraudulent or legitimate.
Handwriting Recognition: Given a set of labeled handwritten letters and digits, the algorithm
learns to recognize new handwritten letters and digits.
Speech Recognition: Given a set of labeled audio samples of speech, the algorithm learns to
transcribe new speech into text.

Recommendation Systems: Given a set of labeled user-item interactions, the algorithm learns to
recommend new items to users based on their preferences.

6.4 Steps in Supervised Learning


The process of supervised learning involves the following steps:
Data Collection: The first step is to collect a dataset that includes input data and corresponding
output labels. The dataset must be large and representative of the problem being solved.
Data Preprocessing: The next step is to preprocess the data to ensure that it is clean and in the
appropriate format. This may include removing outliers, normalizing the data, or transforming it
into a format suitable for training.
Model Selection: The next step is to select an appropriate model architecture that can learn the
mapping between the input and output data. The choice of model depends on the nature of the
problem being solved and the characteristics of the dataset.

Training: The model is trained on the labeled dataset by minimizing a loss function that measures
the difference between the predicted output and the true output. The model learns to adjust its
parameters to improve its predictions.

Evaluation: The trained model is evaluated on a separate test set to measure its performance. This
is important to ensure that the model can generalize to new, unseen data.
Deployment: Once the model has been trained and evaluated, it can be deployed in the real
world to make predictions on new, unseen data.

6.5 Supervised Learning Using R


R is a popular programming language for data analysis and statistical computing that has many
packages for supervised learning. In R, you can use a variety of libraries and functions to train,
evaluate and deploy machine learning models.
Here are some popular R packages for supervised learning:
caret: A comprehensive package for training and evaluating a wide range of machine learning
models.
randomForest: A package for building random forest models, which are an ensemble learning
method for classification and regression.
glmnet: A package for fitting generalized linear models with L1 and L2 regularization.
e1071: A package for building support vector machines for classification and regression.

LOVELY PROFESSIONAL UNIVERSITY 105


Notes
Unit 06: Machine Learning for Businesses

xgboost: A package for building gradient boosting models, which are a type of ensemble learning
method.
keras: A package for building deep learning models using the Keras API.
nnet: A package for building neural network models using the backpropagation algorithm.
ranger: A package for building random forest models that are optimized for speed and memory
efficiency.
gbm: A package for building gradient boosting models, which are a type of ensemble learning
method.
rpart: A package for building decision trees, which are a type of classification and regression tree
method.
These are just a few of the many packages available in R for supervised learning. The choice of
package will depend on the specific problem being solved and the characteristics of the data.

6.6 Supervised Learning using KNN


K-Nearest Neighbors (KNN) is a popular supervised learning algorithm used for classification and
regression problems. In KNN, the model predicts the target variable of a data point by finding the
K nearest data points in the training set and taking the majority vote of their class labels (in
classification) or their average value (in regression).

Example-1
In R, there are many built-in datasets that can be used to demonstrate the use of KNN for
supervised learning. One such dataset is the "iris" dataset, which contains measurements of the
sepal length, sepal width, petal length, and petal width of three different species of iris flowers.
Here's an example of how to use the "iris" dataset for KNN supervised learning in R:
# Load the "iris" dataset
data(iris)
# Split the data into training and testing sets
library(caret)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Preprocess the data by normalizing the features
preprocess <- preProcess(trainData[,1:4], method = c("center", "scale"))
trainData[,1:4] <- predict(preprocess, trainData[,1:4])
testData[,1:4] <- predict(preprocess, testData[,1:4])
# Train the KNN model with K=3 and Euclidean distance metric
library(class)
predicted <- knn(trainData[,1:4], testData[,1:4], trainData$Species, k = 3, metric = "Euclidean")
# Evaluate the model's accuracy
library(caret)
confusionMatrix(predicted, testData$Species)
In this example, we first load the "iris" dataset and split it into training and testing sets using the
createDataPartition() function from the caret package. We then preprocess the data by normalizing
the features using the preProcess() function from the same package.

106 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Next, we train the KNN model using the knn() function from the class package with K=3 and the
Euclidean distance metric. Finally, we evaluate the performance of the model using the
confusionMatrix() function from the caret package, which calculates the accuracy, precision, recall,
and F1 score of the predictions.
Output
The output of the above example is the confusion matrix, which shows the performance of the
KNN model on the testing set. The confusion matrix contains four values:
True Positive (TP): The number of samples that are correctly classified as positive (i.e., the species
of iris is correctly predicted).
False Positive (FP): The number of samples that are incorrectly classified as positive (i.e., the species
of iris is wrongly predicted).
True Negative (TN): The number of samples that are correctly classified as negative (i.e., the species
of iris is not predicted).
False Negative (FN): The number of samples that are incorrectly classified as negative (i.e., the
species of iris is wrongly predicted as not being present).

The confusion matrix shows that the KNN model achieved an accuracy of 0.9333 on the testing set,
which means that 93.33% of the test samples were correctly classified by the model.
The matrix also shows that the model made 1 false prediction for the "versicolor" class and 1 false
prediction for the "virginica" class. In other words, the model correctly classified all 10 "setosa"
species but made 2 incorrect predictions for the other two species.
Additionally, the matrix provides other metrics such as sensitivity, specificity, positive and
negative predictive values, and prevalence for each class. These metrics provide a more detailed
evaluation of the performance of the model on the different classes.

6.7 Supervised Learning using Decision Tree


Decision Tree is a popular algorithm for supervised learning tasks, particularly for classification
problems. It is a non-parametric algorithm that builds a tree-like model by recursively partitioning
the data into subsets based on the most significant features. It is easy to interpret and understand,
and it can handle both categorical and numerical data.

LOVELY PROFESSIONAL UNIVERSITY 107


Notes
Unit 06: Machine Learning for Businesses

In Decision Tree, the tree structure is built based on information gain or entropy reduction, which
measures the reduction in uncertainty about the target variable that results from splitting the data
using a particular attribute. The attribute with the highest information gain is chosen as the
splitting criterion at each node.
The algorithm continues to split the data into subsets until a stopping criterion is met, such as
reaching a maximum tree depth, a minimum number of samples in a leaf node, or when all the
samples in a node belong to the same class.
Once the tree is built, it can be used to make predictions by traversing the tree from the root node
down to a leaf node that corresponds to the predicted class. The class of a new sample is
determined by following the path from the root node to the leaf node that the sample falls into.
To implement Decision Tree in R, we can use the "rpart" package, which provides the "rpart()"
function to build a Decision Tree model. The package also provides functions for visualizing the
tree structure and making predictions on new data.
Example-1
Let's demonstrate how to use the "rpart" package to build a Decision Tree model using an inbuilt
dataset in R. We will use the "iris" dataset, which contains measurements of the sepal length, sepal
width, petal length, and petal width of three different species of iris flowers.
Here is an example code that builds a Decision Tree model on the "iris" dataset:
library(rpart)
data(iris)
# Split the dataset into training and testing sets
set.seed(123)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
trainData <- iris[ind == 1, ]
testData <- iris[ind == 2, ]
# Build the Decision Tree model
model <- rpart(Species ~ ., data = trainData, method = "class")
# Visualize the Decision Tree
plot(model)
text(model)
# Make predictions on the testing set
predicted <- predict(model, testData, type = "class")
# Evaluate the model performance
confusionMatrix(predicted, testData$Species)
In the code above, we first load the "rpart" package and the "iris" dataset. We then split the dataset
into a training set and a testing set, with a 70-30 split.
Next, we build the Decision Tree model using the "rpart()" function, where we specify the target
variable "Species" and the other variables as the predictors using the formula notation "Species ~ .".
We also specify the method as "class" to indicate that this is a classification problem.
After building the model, we can visualize the Decision Tree using the "plot()" and "text()"
functions.
We then use the "predict()" function to make predictions on the testing set and specify the type of
prediction as "class" to obtain the predicted class labels.
Finally, we evaluate the performance of the model using the "confusionMatrix()" function from the
"caret" package, which computes the confusion matrix and other metrics such as accuracy,
sensitivity, and specificity.

108 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

The output of the "confusionMatrix()" function provides a detailed evaluation of the performance of
the Decision Tree model on the testing set. For example, it shows the accuracy of the model, the
number of true positives, true negatives, false positives, and false negatives for each class, as well as
other performance metrics such as sensitivity, specificity, positive predictive value, and negative
predictive value.
Output
The output of the "confusionMatrix()" function shows that the model has an accuracy of 95.56%,
which means that it correctly predicted the species of 43 out of 45 instances in the testing set.
The confusion matrix shows that there is only one misclassification, where one instance of the
"setosa" species was misclassified as "versicolor". The model correctly predicted all instances of the
"virginica" species.

Overall, the result shows that the Decision Tree model performed well on the "iris" dataset,
achieving high accuracy and making only one misclassification. However, it is important to note
that the dataset is relatively small and simple, so the model's performance may not generalize well
to other datasets.

6.8 Unsupervised Learning


Unsupervised learning is a type of machine learning where the goal is to find patterns and
relationships in data without any labeled examples or specific targets to predict. Unlike supervised
learning, where the algorithm is trained on labeled data, unsupervised learning algorithms are
trained on unlabeled data.
The main objective of unsupervised learning is to discover the underlying structure of the data and
learn meaningful representations of the input data. Common unsupervised learning techniques
include clustering, dimensionality reduction, and anomaly detection.
Clustering algorithms group similar data points together into clusters, based on their similarity.
Dimensionality reduction techniques reduce the number of features in the data, while preserving
the most important information. Anomaly detection algorithms identify data points that are
significantly different from the majority of the data.
Unsupervised learning has numerous applications in various fields, such as computer vision,
natural language processing, and data analysis. It can be used to find hidden patterns in customer

LOVELY PROFESSIONAL UNIVERSITY 109


Notes
Unit 06: Machine Learning for Businesses

data, identify groups of similar images or documents, and improve recommendations in


recommendation systems.
Unsupervised learning has a wide range of use cases in different fields. Some of the common use
cases of unsupervised learning are:
Clustering: Clustering algorithms are widely used in market segmentation, social network analysis,
image and speech recognition, and customer segmentation. For example, clustering can be used to
group similar customers together based on their purchase history and demographic information.
Anomaly detection: Anomaly detection algorithms can be used to detect fraudulent activities,
network intrusions, and outliers in data. For example, credit card companies use anomaly detection
algorithms to identify fraudulent transactions.
Dimensionality reduction: Dimensionality reduction techniques can be used to reduce the number
of features in data, while retaining the most important information. This can be used for data
visualization, feature extraction, and noise reduction.
Recommender systems: Recommender systems use unsupervised learning algorithms to analyze
user preferences and provide personalized recommendations. For example, online retailers use
collaborative filtering to recommend products to customers based on their purchase history and
browsing behavior.
Natural language processing: Unsupervised learning algorithms can be used for text analysis and
language modeling. For example, topic modeling can be used to identify themes and topics in a
large collection of documents.
Image and speech recognition: Unsupervised learning algorithms can be used to analyze and
classify images and speech. For example, clustering algorithms can be used to group similar images
together, while autoencoders can be used to learn features from images for object recognition.
Overall, unsupervised learning can be used in many applications where the data is unlabeled or the
target variable is unknown. It is a powerful tool for discovering patterns and relationships in data,
and can be used to gain insights and make better decisions.

6.9 Steps in Un-Supervised Learning


The steps involved in unsupervised learning are as follows:
Data collection: The first step in unsupervised learning is to collect the data that needs to be
analyzed. The data can come from various sources such as sensors, logs, or surveys. The data can
be in the form of structured or unstructured data.
Data pre-processing: The next step is to clean and pre-process the data. This includes removing
missing values, handling outliers, and normalizing the data. Data pre-processing is a crucial step in
unsupervised learning as it can affect the quality of the results.
Feature extraction: Unsupervised learning algorithms work with features, so the next step is to
extract relevant features from the data. Feature extraction can be done through techniques such as
principal component analysis (PCA), independent component analysis (ICA), or non-negative
matrix factorization (NMF).
Model selection: The next step is to select the appropriate unsupervised learning model based on
the problem and data. There are many unsupervised learning models such as clustering,
dimensionality reduction, and anomaly detection.
Model training: Once the model is selected, the next step is to train the model on the pre-processed
data. Unsupervised learning models do not require labeled data, so the model learns from the data
without any specific targets to predict.
Model evaluation: After the model is trained, the next step is to evaluate the performance of the
model. The evaluation metrics depend on the type of unsupervised learning problem. For example,
clustering algorithms can be evaluated using metrics such as silhouette score or Dunn index.
Model deployment: The final step is to deploy the model in a production environment. The
deployment can be done on-premise or in the cloud depending on the requirements.

110 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Overall, unsupervised learning requires a careful analysis of the data, selection of appropriate
features and models, and evaluation of the results. It is an iterative process where the model can be
refined based on the results and feedback.

6.10 Unsupervised Learning Using R


R is a popular programming language for data analysis and has many built-in functions and
packages for unsupervised learning. Here are the steps to perform unsupervised learning using R:
Load data: The first step is to load the data into R. R supports a wide range of data formats such as
CSV, Excel, and SQL databases. You can use the read.csv function to load a CSV file.
Data pre-processing: The next step is to pre-process the data. This includes handling missing
values, scaling the data, and removing outliers. You can use functions such as na.omit to remove
missing values and scale to standardize the data.
Feature extraction: The next step is to extract relevant features from the data. R has many built-in
functions for feature extraction such as PCA (prcomp function) and NMF (nmf package).
Model selection: The next step is to select the appropriate unsupervised learning model based on
the problem and data. R has many built-in functions and packages for unsupervised learning
models such as k-means clustering (kmeans function), hierarchical clustering (hclust function), and
t-SNE (Rtsne package).
Model training: Once the model is selected, the next step is to train the model on the pre-processed
data. You can use the fit function to train the model.
Model evaluation: After the model is trained, the next step is to evaluate the performance of the
model. You can use metrics such as silhouette score or Dunn index for clustering algorithms. R has
many built-in functions for evaluation such as silhouette and clusplot.
Model deployment: The final step is to deploy the model in a production environment. R allows
you to save the model as an R object and load it for deployment.
Overall, R is a powerful tool for unsupervised learning with many built-in functions and packages.
The process involves loading the data, pre-processing the data, feature extraction, model selection,
model training, model evaluation, and model deployment.

6.11 Unsupervised learning using K-means


K-means is a popular clustering algorithm used in unsupervised learning. Here are the steps to
perform unsupervised learning using K-means:
Load data: The first step is to load the data into the programming language of your choice. The data
should be pre-processed and feature extracted.
Choose the number of clusters: The next step is to choose the number of clusters for K-means. The
number of clusters should be chosen based on the problem and data. There are various methods to
determine the number of clusters such as the elbow method, silhouette method, and gap statistic.
Initialize centroids: The next step is to randomly initialize the centroids for K-means. Centroids are
the points that represent the centers of the clusters. You can use the kmeans++ initialization method
to choose the initial centroids.
Assign data points to clusters: The next step is to assign each data point to the nearest centroid.
This is done by calculating the distance between each data point and each centroid.
Recalculate centroids: The next step is to recalculate the centroids for each cluster. This is done by
taking the mean of all the data points in the cluster.
Repeat steps 4 and 5: Steps 4 and 5 are repeated until the centroids converge. The convergence
criteria can be based on the number of iterations or the change in centroids.
Evaluate the results: The final step is to evaluate the results. This can be done by calculating the
within-cluster sum of squares (WCSS) or by visualizing the clusters. You can use the plot function
to visualize the clusters in two or three dimensions.

LOVELY PROFESSIONAL UNIVERSITY 111


Notes
Unit 06: Machine Learning for Businesses

Overall, K-means is a popular clustering algorithm used in unsupervised learning. The process
involves choosing the number of clusters, initializing centroids, assigning data points to clusters,
recalculating centroids, and repeating until convergence. The results can be evaluated by
calculating the WCSS or by visualizing the clusters.
Example-1
# Load the iris dataset
data(iris)
# Select the relevant features
iris_data <- iris[,1:4]
# Scale the data
scaled_data <- scale(iris_data)
# Perform K-means clustering with 3 clusters
kmeans_result <- kmeans(scaled_data, centers = 3)
# Plot the clusters
library(cluster)
clusplot(scaled_data, kmeans_result$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)
# Calculate the within-cluster sum of squares
wss <- sum(kmeans_result$withinss)
# Print the WCSS
cat("WCSS:", wss)
In this example, the iris dataset is loaded and the relevant features are selected. The data is then
scaled using the scale function. K-means clustering is performed with 3 clusters using the kmeans
function. The clusters are then visualized using the clusplot function from the cluster package. The
within-cluster sum of squares (WCSS) is calculated using the kmeans_result$withinss variable and
printed using the cat function.
The output of this code will be a plot of the clusters and the WCSS value. The plot will show the
data points colored by their assigned cluster and the WCSS value will indicate the sum of the
squared distances between each data point and its assigned cluster center.
Output
The output of the example code using K-means clustering on the iris dataset in R includes a plot of
the clusters and the within-cluster sum of squares (WCSS) value.
The plot shows the data points colored by their assigned cluster and separated into three clusters.
The clusplot function from the cluster package is used to create the plot. The horizontal axis shows
the first principal component of the data, while the vertical axis shows the second principal
component. The data points are shaded to show the density of the points in each region. The plot
shows that the data points are well-separated into three distinct clusters, each containing data
points that are similar to each other.
The WCSS value is a measure of the goodness of fit of the K-means clustering model. It measures
the sum of the squared distances between each data point and its assigned cluster center. The lower
the WCSS value, the better the model fits the data. In this example, the WCSS value is printed to the
console using the cat function. The WCSS value for this model is 139.82.
Overall, the output of this example demonstrates how to use K-means clustering on the iris dataset
in R. The plot shows the clusters in the data, while the WCSS value indicates the goodness of fit of
the model.

6.12 Unsupervised Learning using Hierarchical Clustering


Hierarchical clustering is another unsupervised learning technique that is used to group similar
data points together based on their distances from each other. The basic idea of hierarchical

112 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

clustering is to build a hierarchy of clusters, starting with each data point in its own cluster and
then merging the most similar clusters together until all of the data points are in a single cluster.
Example-1
In R, we can use the hclust function to perform hierarchical clustering on a dataset. Here is an
example of using hierarchical clustering on the iris dataset in R:
# Load the iris dataset
data(iris)
# Select the relevant features
iris_data <- iris[,1:4]
# Calculate the distance matrix
dist_matrix <- dist(iris_data)
# Perform hierarchical clustering with complete linkage
hc_result <- hclust(dist_matrix, method = "complete")
# Plot the dendrogram
plot(hc_result)
In this example, we first load the iris dataset and select the relevant features. We then calculate the
distance matrix using the dist function. The hclust function is used to perform hierarchical
clustering with complete linkage. The resulting dendrogram is plotted using the plot function.
The output of this code will be a dendrogram that shows the hierarchy of clusters. The dendrogram
shows the data points at the bottom, with lines connecting the clusters that are merged together.
The height of each line indicates the distance between the merged clusters. The dendrogram can be
used to identify the number of clusters in the data based on the distance between the clusters.
Output
The output of the example code using hierarchical clustering on the iris dataset in R is a
dendrogram that shows the hierarchy of clusters.
The dendrogram is a plot that displays the hierarchy of clusters, with the data points at the bottom
and the merged clusters shown as lines that connect the points. The height of each line indicates the
distance between the merged clusters. The dendrogram can be used to determine the optimal
number of clusters in the data based on the distance between the clusters.
In this example, we used the hclust function to perform hierarchical clustering with complete
linkage on the iris dataset. The resulting dendrogram shows that there are three main clusters of
data points in the dataset, which is consistent with the known number of classes in the iris dataset.
By analyzing the dendrogram, we can see that the first split in the data occurs between the setosa
species and the other two species. The second split separates the remaining two species, versicolor
and virginica. The dendrogram can also be used to identify the distance between clusters, which
can be useful for determining the optimal number of clusters to use for further analysis.
Overall, the output of this example demonstrates how to use hierarchical clustering to analyze a
dataset in R and visualize the results using a dendrogram.

6.13 Classification and Prediction Accuracy in Unsupervised Learning


Classification and prediction accuracy are not typically used as metrics to evaluate the performance
of unsupervised learning algorithms. This is because unsupervised learning is not focused on
making predictions or classifying new data points based on their features, but rather on finding
underlying patterns and relationships within the data itself.
In unsupervised learning, we don't have labeled data to compare the results of the algorithm to, so
we cannot measure accuracy in the traditional sense. Instead, we use other metrics such as within-
cluster sum of squares (WCSS) and silhouette score to evaluate the quality of the clusters formed by
the algorithm.

LOVELY PROFESSIONAL UNIVERSITY 113


Notes
Unit 06: Machine Learning for Businesses

WCSS is a measure of the sum of the squared distances between each data point and its assigned
cluster center. A lower WCSS value indicates a better clustering performance. Silhouette score is a
measure of how well each data point fits into its assigned cluster compared to other clusters. A
higher silhouette score indicates a better clustering performance.
While classification and prediction accuracy are not used to evaluate unsupervised learning
algorithms directly, they can be used in certain scenarios to evaluate the performance of an
unsupervised learning model indirectly. For example, if we use the clusters formed by an
unsupervised learning algorithm as input features for a subsequent classification or prediction task,
we can use classification and prediction accuracy as metrics to evaluate the performance of the
overall model.
In summary, while classification and prediction accuracy are not typically used to evaluate the
performance of unsupervised learning algorithms, they can be used in certain scenarios to evaluate
the performance of the overall model that uses the clusters formed by the unsupervised learning
algorithm as input features.

Summary
Machine learning is a field of artificial intelligence that involves developing algorithms and models
that enable computers to learn from data without being explicitly programmed. Machine learning is
used in a wide range of applications, from image and speech recognition to fraud detection and
recommendation systems. There are three main types of machine learning: supervised learning,
unsupervised learning, and reinforcement learning. In supervised learning, the machine is trained
using labeled data, while in unsupervised learning, the machine is trained using unlabeled data.
Reinforcement learning involves training a machine to learn through trial and error. Machine
learning algorithms are typically designed to improve over time as they are exposed to more data,
and they are used in a variety of industries and fields to automate decision-making and solve
complex problems. Studying machine learning provides students with a diverse set of skills and
knowledge, including programming, data handling, analytical and problem-solving skills,
collaboration, and communication skills.
Supervised learning algorithms can be further categorized as either classification or regression,
depending on the nature of the target variable. In classification problems, the target variable is
categorical, and the goal is to predict the class label or category of a new instance. In regression
problems, the target variable is continuous, and the goal is to predict a numerical value or a range
of values for a new instance.
Some common examples of supervised learning algorithms include linear regression, logistic
regression, decision trees, random forests, support vector machines (SVMs), k-nearest neighbors
(KNN), and neural networks. Each algorithm has its own strengths and weaknesses, and the choice
of algorithm depends on the nature of the problem and the characteristics of the data.
Supervised learning has many practical applications in fields such as healthcare, finance,
marketing, and engineering, among others. For example, supervised learning can be used to predict
which patients are at risk of developing a certain disease, to identify potential fraudulent
transactions in financial transactions, or to recommend products to customers based on their
browsing history.
Unsupervised learning algorithms can be used for a variety of tasks, including clustering similar
data points, reducing the dimensionality of data, and discovering hidden structures in the data.
Some common techniques used in unsupervised learning include k-means clustering, hierarchical
clustering, and principal component analysis (PCA).
The performance of unsupervised learning algorithms is typically evaluated using metrics such as
within-cluster sum of squares (WCSS) and silhouette score, which are used to evaluate the quality
of the clusters formed by the algorithm.
While unsupervised learning algorithms are not typically used for making predictions or
classifying new data points, the insights gained from analyzing the data can be used to inform
subsequent supervised learning models or other data analysis tasks. Overall, unsupervised learning
is a valuable tool for exploring and understanding complex data without prior knowledge or
guidance.

114 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

Keywords
Artificial Intelligence (AI): A field of computer science that focuses on creating intelligent
machines that can perform tasks that typically require human-like intelligence.
Big data: A large and complex data set that requires advanced tools and techniques to process and
analyze.
Data mining: The process of discovering patterns, trends, and insights in large data sets using
machine learning algorithms.
Deep learning: A subset of machine learning that uses artificial neural networks to model and
solve complex problems.
Neural network: A machine learning algorithm that is inspired by the structure and function of the
human brain.
Supervised learning: A type of machine learning where the machine is trained using labeled data,
with a clear input and output relationship.
Unsupervised learning: A type of machine learning where the machine is trained using unlabeled
data, with no clear input and output relationship.
Reinforcement learning: A type of machine learning where the machine learns by trial and error,
receiving feedback on its actions and adjusting its behavior accordingly.
Model: A mathematical representation of a real-world system or process, which is used to make
predictions or decisions based on data. In machine learning, models are typically trained on data to
improve their accuracy and performance.
Dimensionality reduction: The process of reducing the number of features used in a machine
learning model while still retaining important information. This is often done to improve
performance and reduce overfitting.
Overfitting: A problem that occurs when a machine learning model is too complex and learns to fit
the training data too closely. This can lead to poor generalization to new data.
Underfitting: A problem that occurs when a machine learning model is too simple and fails to
capture important patterns in the data. This can lead to poor performance on the training data and
new data.
Bias: A systematic error that occurs when a machine learning model consistently makes predictions
that are too high or too low.
Variance: The amount by which a machine learning model's output varies with different training
data sets. High variance can lead to overfitting.
Regularization: Techniques used to prevent overfitting in machine learning models, such as adding
a penalty term to the cost function.

SelfAssessment
1. Which of the following is a type of machine learning where the machine is trained using
labeled data?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above

2. What is the process of reducing the number of features used in a machine learning model
while still retaining important information?
A. Overfitting
B. Underfitting
C. Dimensionality reduction

LOVELY PROFESSIONAL UNIVERSITY 115


Notes
Unit 06: Machine Learning for Businesses

D. Bias

3. Which of the following is a technique used to prevent overfitting in machine learning


models?
A. Ensemble learning
B. Gradient descent
C. Regularization
D. Hyperparameter tuning

4. What is the name of a machine learning algorithm that is inspired by the structure and
function of the human brain?
A. Neural network
B. Gradient descent
C. Decision tree
D. Support vector machine

5. Which type of machine learning involves training a machine to learn through trial and
error?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above

6. Which of the following is a type of machine learning where the machine is trained using
unlabeled data?
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None of the above

7. What is the process of discovering patterns, trends, and insights in large data sets using
machine learning algorithms called?
A. Feature engineering
B. Deep learning
C. Data mining
D. Supervised learning

8. Which of the following techniques is used to combine multiple machine learning models to
improve performance and reduce overfitting?
A. Gradient descent
B. Hyperparameter tuning
C. Ensemble learning
D. Regularization

116 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

9. What is the name of an optimization algorithm used to find the optimal parameters for a
machine learning model by iteratively adjusting them in the direction of steepest descent of
the cost function?
A. Gradient descent
B. Hyperparameter tuning
C. Regularization
D. Ensemble learning

10. Which of the following is a technique used to prevent underfitting in machine learning
models?
A. Ensemble learning
B. Gradient descent
C. Regularization
D. Hyperparameter tuning
11. Which of the following is not a supervised learning problem?
A. Image classification
B. Text clustering
C. Stock price prediction
D. Sentiment analysis

12. In supervised learning, what is the target variable?


A. The independent variable
B. The dependent variable
C. The test set
D. The training set

13. Which package in R provides functions for classification and regression trees?
A. caret
B. e1071
C. randomForest
D. rpart

14. What is the purpose of the "predict()" function in R?


A. To train a model on a dataset
B. To test a model on a dataset
C. To predict the target variable for new instances
D. To visualize the decision tree

15. What is the purpose of the "caret" package in R?


A. To train neural networks
B. To perform feature selection
C. To tune model hyperparameters
D. To cluster data

16. Which algorithm in R is used for K-nearest neighbors classification?


A. kmeans

LOVELY PROFESSIONAL UNIVERSITY 117


Notes
Unit 06: Machine Learning for Businesses

B. knn
C. svm
D. lda

17. Which function in R is used to create a confusion matrix?


A. cor()
B. lm()
C. summary()
D. confusionMatrix()

18. Which of the following evaluation metrics is not used for classification problems?
A. Mean squared error (MSE)
B. Accuracy
C. Precision
D. Recall

19. In which type of supervised learning problem is the target variable categorical?
A. Regression
B. Clustering
C. Classification
D. Dimensionality reduction

20. What is the purpose of the "glmnet" package in R?


A. To perform linear regression
B. To perform logistic regression
C. To perform Lasso or Ridge regression
D. To perform K-means clustering

21. Which of the following is an example of unsupervised learning?


A. Linear regression
B. Logistic regression
C. K-means clustering
D. Decision trees

22. Which R package is commonly used for performing hierarchical clustering?


A. ggplot2
B. dplyr
C. cluster
D. caret

23. What is a dendrogram in the context of unsupervised learning?


A. A measure of the sum of squared distances between each data point and its assigned cluster
center
B. A plot that displays the hierarchy of clusters
C. A measure of how well each data point fits into its assigned cluster compared to other
clusters

118 LOVELY PROFESSIONAL UNIVERSITY


Notes

Business Analytics

D. A measure of the similarity between two data points

24. What is the purpose of the silhouette score in unsupervised learning?


A. To measure the sum of squared distances between each data point and its assigned cluster
center
B. To measure how well each data point fits into its assigned cluster compared to other clusters
C. To measure the similarity between two data points
D. To measure the variance within each cluster

25. What is the difference between supervised and unsupervised learning?


A. Supervised learning requires labeled data, while unsupervised learning does not.
B. Supervised learning is used for clustering, while unsupervised learning is used for
classification.
C. Supervised learning is used for dimensionality reduction, while unsupervised learning is
used for feature selection.
D. Supervised learning is used for finding patterns and relationships in data, while
unsupervised learning is used for making predictions.

Answers forSelf Assessment


l. A 2. C 3. C 4. A 5. C

6. B 7. C 8. C 9. A 10. D

11. B 12. B 13. D 14. C 15. C

16. B 17. D 18. A 19. C 20. C

21. C 22. C 23. B 24. B 25. A

Review Question
1) What is machine learning, and how is it different from traditional programming?
2) What are the three main types of machine learning, and what are some examples of
problems each type can solve?
3) What is the process of preparing data for use in a machine learning model, and why is it
important?
4) What are some real-world applications of supervised learning, and how are they
implemented?
5) How can machine learning be used to improve healthcare outcomes, and what are some
potential benefits and risks of using machine learning in this context?
6) How can machine learning be used to improve financial decision-making, and what are
some potential benefits and risks of using machine learning in this context?
7) How can machine learning be used to detect and prevent fraud, and what are some potential
benefits and risks of using machine learning in this context?

LOVELY PROFESSIONAL UNIVERSITY 119


Notes
Unit 06: Machine Learning for Businesses

8) How can machine learning be used to optimize supply chain management, and what are
some potential benefits and risks of using machine learning in this context?
9) How can machine learning be used to improve customer service and customer experience,
and what are some potential benefits and risks of using machine learning in this context?
10) How can machine learning be used to enhance security and privacy, and what are some
potential benefits and risks of using machine learning in this context?
11) How can machine learning be used to advance scientific research, and what are some
potential benefits and risks of using machine learning in this context?

Further readings
learning, including tutorials, code examples, and best practices. It also includes a section
on deep learning, which is a type of machine learning that is particularly well-suited for
tasks like image recognition and natural language processing.
The Stanford Machine Learning Group: This research group at Stanford University is at
the forefront of developing new machine learning techniques and applications. Their
website includes a wide range of research papers, code libraries, and other resources for
exploring the latest developments in the field.
The Google AI Blog: Google is one of the leading companies in the field of machine
learning, and their AI blog offers insights into the latest research, tools, and applications.
They cover a wide range of topics, from natural language processing and computer vision
to ethics and fairness in machine learning.
The Microsoft Research Blog: Microsoft is another major player in the field of machine
learning, and their research blog covers a wide range of topics related to AI, including
machine learning, deep learning, and natural language processing. They also offer a
variety of tools and resources for developers who want to build machine learning
applications.
The MIT Technology Review: This publication covers a wide range of topics related to
technology and its impact on society, including machine learning. Their articles are often
well-researched and thought-provoking and can provide insights into the broader
implications of machine learning for society and the economy.

120 LOVELY PROFESSIONAL UNIVERSITY


Notes

Dr. Mohd Imran Khan, Lovely Professional University Unit 07: Text Analytics for Business

Unit 07: Text Analytics for Business


Objective
Through this chapter student will be able to

 Understand the key concepts and techniques of text analytics


 Develop data analysis skills
 Gain insights into customer behavior and preferences
 Enhance decision-making skills
 Improve business performance

Introduction
Text analytics for business involves using advanced computational techniques to analyze and
extract insights from large volumes of text data. This data can come from a wide range of sources,
including customer feedback, social media posts, product reviews, news articles, and more.
The goal of text analytics for business is to provide organizations with valuable insights that can be
used to make data-driven decisions and improve business performance. This includes identifying
patterns and trends in customer behavior, predicting future trends, monitoring brand reputation,
detecting fraud, and more.
Some of the key techniques used in text analytics for business include natural language processing
(NLP), which involves using computational methods to analyze and understand human language,
and machine learning algorithms, which can be trained to automatically identify patterns and
relationships in text data.
There are many different tools and platforms available for text analytics, ranging from open-source
software to commercial solutions. These tools typically include features for data cleaning and
preprocessing, feature extraction, data visualization, and more.
Overall, text analytics for business can provide organizations with a powerful tool for
understanding and leveraging the vast amounts of text data available to them. By using these
techniques to extract insights and make data-driven decisions, businesses can gain a competitive
advantage and improve their overall performance.
Text analytics for business is a powerful tool for analyzing large volumes of text data and extracting
valuable insights that can be used to make data-driven decisions. However, it is important to keep
in mind several key considerations when working with text data.
Firstly, domain expertise is crucial when analyzing text data. This means having a deep
understanding of the specific industry or context in which the text data is being analyzed. This is
especially important for industries such as healthcare or finance, where specialized knowledge is
required to properly interpret the data.
Secondly, it is important to consider the ethical implications of text analytics. This includes
ensuring that data privacy regulations are followed, and that the data is used ethically and
responsibly. It is also important to be transparent about the use of text analytics and to obtain
consent from those whose data is being analyzed.
Thirdly, integrating text data with other data sources can provide a more comprehensive
understanding of business operations and customer behavior. This can include structured data
from databases or IoT devices, as well as other sources of unstructured data such as images or
audio.
Fourthly, it is important to be aware of the limitations of text analytics. While text analytics is a
powerful tool, automated methods may struggle with complex or nuanced language, or with
accurately interpreting sarcasm or irony.

LOVELY PROFESSIONAL UNIVERSITY 121


Notes

Business Analytics

Finally, data visualization is an important component of text analytics. Effective visualization


techniques can help decision-makers understand complex patterns and relationships in text data
and make more informed decisions.
Overall, text analytics for business is a rapidly growing field that has the potential to provide
organizations with valuable insights into customer behavior, market trends, and more. By
leveraging the latest computational techniques and tools, businesses can gain a competitive
advantage and improve their overall performance. However, it is important to consider the ethical
implications of text analytics, and to use the tool responsibly and transparently.

7.1 Text Analytics


Text analytics combines a set of machine learning, statistical and linguistic techniques to process
large volumes of unstructured text or text that does not have a predefined format, to derive insights
and patterns. It enables businesses, governments, researchers, and media to exploit the enormous
content at their disposal for making crucial decisions. Text analytics uses a variety of techniques –
sentiment analysis, topic modelling, named entity recognition, term frequency, and event
extraction.

What’s the Difference Between Text Mining and Text Analytics?


Text mining and text analytics are often used interchangeably. The term text mining is generally
used to derive qualitative insights from unstructured text, while text analytics provides
quantitative results.
For example, text mining can be used to identify if customers are satisfied with a product by
analyzing their reviews and surveys. Text analytics is used for deeper insights, like identifying a
pattern or trend from the unstructured text. For example, text analytics can be used to understand a
negative spike in the customer experience or popularity of a product.
The results of text analytics can then be used with data visualization techniques for easier
understanding and prompt decision making.
What’s the Relevance of Text Analytics in Today’s World?
As of 2020, around 4.57 billion people have access to the internet. That’s roughly 59 percent of the
world’s population. Out of which, about 49 percent of people are active on social media. An
enormous amount of text data is generated every day in the form of blogs, tweets, reviews, forum
discussions, and surveys. Besides, most customer interactions are now digital, which creates
another huge text database.
Most of the text data is unstructured and scattered around the web. If this text data is gathered,
collated, structured, and analyzed correctly, valuable knowledge can be derived from it.
Organizations can use these insights to take actions that enhance profitability, customer
satisfaction, research, and even national security.

Benefits of Text Analytics

122 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

There are a range of ways that text analytics can help businesses, organizations, and event social
movements:
Help businesses to understand customer trends, product performance, and service quality. This
results in quick decision making, enhancing business intelligence, increased productivity, and cost
savings.
Helps researchers to explore a great deal of pre-existing literature in a short time, extracting what is
relevant to their study. This helps in quicker scientific breakthroughs.
Assists in understanding general trends and opinions in the society, that enable governments and
political bodies in decision making.
Text analytic techniques help search engines and information retrieval systems to improve their
performance, thereby providing fast user experiences.
Refine user content recommendation systems by categorizing related content.

Text Analytics Techniques and Use Cases


There are several techniques related to analyzing the unstructured text. Each of these techniques is
used for different use case scenarios.

Sentiment analysis
Sentiment analysis is used to identify the emotions conveyed by the unstructured text. The input
text includes product reviews, customer interactions, social media posts, forum discussions, or
blogs. There are different types of sentiment analysis. Polarity analysis is used to identify if the text
expresses positive or negative sentiment. The categorization technique is used for a more fine-
grained analysis of emotions - confused, disappointed, or angry.

Use cases of sentiment analysis:


Customer feedback analysis: Sentiment analysis can be used to analyze customer feedback, such as
product reviews or survey responses, to understand customer sentiment and identify areas for
improvement. For example, a company could use sentiment analysis to identify common themes in
negative reviews and use that information to improve product features or customer service.

Brand reputation monitoring: Sentiment analysis can be used to monitor brand reputation by
analyzing mentions of a company or brand on social media, news articles, or other online sources.
This can help companies to identify negative sentiment or potential issues and respond quickly to
protect their brand reputation.

Market research: Sentiment analysis can be used in market research to understand consumer
sentiment towards a particular product, service, or brand. This can help companies to identify
opportunities for innovation or new product development.
Financial analysis: Sentiment analysis can be used in financial analysis to analyze the sentiment
expressed in news articles, social media posts, and other sources of financial news. This can help
investors to make more informed decisions by identifying potential risks and opportunities.

LOVELY PROFESSIONAL UNIVERSITY 123


Notes

Business Analytics

Political analysis: Sentiment analysis can be used in political analysis to analyze public opinion
and sentiment towards political candidates or issues. This can help political campaigns to identify
key issues and target their messaging more effectively.

Topic modelling
This technique is used to find the major themes or topics in a massive volume of text or a set of
documents. Topic modeling identifies the keywords used in text to identify the subject of the
article.

Use cases of topic modeling:


Content categorization and organization: Topic modeling can be used to automatically categorize
and organize large volumes of text data, such as news articles or research papers. This can help
researchers, journalists, or content creators to quickly identify relevant articles and topics.

Customer feedback analysis: Topic modeling can be used to analyze customer feedback, such as
product reviews or survey responses, to identify common themes or topics. This can help
companies to identify areas for improvement and prioritize customer needs.

Trend analysis: Topic modeling can be used to identify trends and patterns in large volumes of text
data, such as social media posts or news articles. This can help companies to stay up-to-date on the
latest trends and identify emerging topics or issues.

Competitive analysis: Topic modeling can be used to analyze competitor websites, social media
pages, and other online sources to identify key topics or themes. This can help companies to stay
competitive by understanding the strengths and weaknesses of their competitors.

Content recommendation: Topic modeling can be used to recommend relevant content to


customers based on their interests and preferences. This can help companies to provide a more
personalized experience for their customers and increase engagement.

Named Entity Recognition (NER)


NER is a text analytics technique used for identifying named entities like people, places,
organizations, and events in unstructured text. NER extracts nouns from the text and determines
the values of these nouns.

124 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

Use cases of named entity recognition:


Customer relationship management: NER can be used to identify the names and contact
information of customers mentioned in customer feedback, such as product reviews or survey
responses. This can help companies to personalize their communication with customers and
improve customer satisfaction.

Fraud detection: NER can be used to identify names, addresses, and other personal information
associated with fraudulent activities, such as credit card fraud or identity theft. This can help
financial institutions and law enforcement agencies to prevent fraud and protect their customers.

Media monitoring: NER can be used to monitor mentions of specific companies, individuals, or
topics in news articles or social media posts. This can help companies to stay up-to-date on the
latest trends and monitor their brand reputation.

Market research: NER can be used to identify the names and affiliations of experts or key
influencers in a particular industry or field. This can help companies to conduct more targeted
research and identify potential collaborators or partners.

Document categorization: NER can be used to automatically categorize documents based on the
named entities mentioned in the text. This can help companies to quickly identify relevant
documents and extract useful information.

Event extraction
This is a text analytics technique that is an advancement over the named entity extraction. Event
extraction recognizes events mentioned in text content, for example, mergers, acquisitions, political
moves, or important meetings. Event extraction requires an advanced understanding of the
semantics of text content. Advanced algorithms strive to recognize not only events but the venue,
participants, date, and time wherever applicable. Event extraction is a beneficial technique that has
multiple uses across fields.

LOVELY PROFESSIONAL UNIVERSITY 125


Notes

Business Analytics

Use cases of event extraction:


Link analysis: This is a technique to understand “who met whom and when” through event
extraction from communication over social media. This is used by law enforcement agencies to
predict possible threats to national security.
Geospatial analysis: When events are extracted along with their locations, the insights can be used
to overlay them on a map. This is helpful in the geospatial analysis of the events.
Business risk monitoring: Large organizations deal with multiple partner companies and
suppliers. Event extraction techniques allow businesses to monitor the web to find out if any of
their partners, like suppliers or vendors, are dealing with adverse events like lawsuits or
bankruptcy.
Social media monitoring: Event extraction can be used to monitor social media posts and identify
events or activities related to a particular topic or brand. This can help companies to stay up-to-date
on the latest trends and monitor their brand reputation.
Fraud detection: Event extraction can be used to identify suspicious activities or events associated
with fraudulent behavior, such as credit card fraud or money laundering. This can help financial
institutions and law enforcement agencies to prevent fraud and protect their customers.
Supply chain management: Event extraction can be used to track and monitor events related to the
supply chain, such as shipment delays or inventory shortages. This can help companies to optimize
their supply chain operations and improve customer satisfaction.
Risk management: Event extraction can be used to identify potential risks or threats, such as
natural disasters or cyber attacks. This can help companies to mitigate the impact of these events
and protect their assets.
News analysis: Event extraction can be used to analyze news articles and identify key events or
activities related to a particular industry or topic. This can help companies to stay informed and
make more informed decisions.

7.2 Creating and Refining Text Data


Creating and refining text data using R programming involves several steps, including:
Data collection: The first step is to collect the text data you want to analyze. This could be from a
variety of sources, such as social media, customer feedback, or news articles.
Data cleaning: Once you have collected the data, the next step is to clean and preprocess it. This
may involve removing stop words, punctuation, and special characters, as well as converting text to
lowercase and removing any unwanted or irrelevant data.
Tokenization: Tokenization is the process of breaking up the text data into individual words or
tokens. This is an important step for many text analytics techniques, as it enables you to analyze the
text data at a more granular level.

126 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

Stemming and lemmatization: Stemming and lemmatization are techniques used to reduce words
to their base form or root form. This can help to reduce the dimensionality of the data and improve
the accuracy of text analytics models.
Sentiment analysis: Sentiment analysis is a common text analytics technique used to identify the
sentiment or emotion expressed in the text data. R programming offers several packages and
functions for sentiment analysis, including the popular "tidytext" and "sentimentr" packages.
Topic modeling: Topic modeling is another common text analytics technique used to identify the
underlying topics or themes in the text data. R programming offers several packages and functions
for topic modeling, including the "tm" and "topicmodels" packages.
Named entity recognition: Named entity recognition is a technique used to identify and classify
named entities, such as people, organizations, and locations, within the text data. R programming
offers several packages and functions for named entity recognition, including the "openNLP" and
"NLP" packages.
Overall, R programming provides a wide range of tools and techniques for creating and refining
text data for text analytics. By using R programming to preprocess, analyze, and visualize text data,
businesses can gain valuable insights into customer behavior, market trends, and potential risks or
opportunities.

7.3 Developing World Cloud using R


To develop a word cloud using R, we need to install and load the 'wordcloud' and 'tm' packages.
The 'tm' package provides text mining functionalities, and the 'wordcloud' package provides
functions for creating word clouds.
Example-1
Here's an example code for creating a word cloud from a text file:
# Install and load required packages
install.packages("tm")
install.packages("wordcloud")
library(tm)
library(wordcloud)
# Load text data from a file
text <- readLines("text_file.txt")
# Create a corpus
corpus <- Corpus(VectorSource(text))
# Clean the corpus
corpus <- tm_map(corpus, tolower) # convert to lowercase
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeWords, stopwords("english")) # remove stopwords
# Create a term document matrix
tdm <- TermDocumentMatrix(corpus)
# Convert the term document matrix to a frequency matrix
freq<- as.matrix(tdm)
freq<- sort(rowSums(freq), decreasing = TRUE)
# Create a word cloud
wordcloud(words = names(freq), freq = freq, min.freq = 2,

LOVELY PROFESSIONAL UNIVERSITY 127


Notes

Business Analytics

max.words = 100, random.order = FALSE, rot.per = 0.35,


colors = brewer.pal(8, "Dark2"))
OUTPUT
In this example, we first load the text data from a file and create a corpus. We then clean the corpus
by converting all the text to lowercase, removing numbers, removing punctuation, and removing
stopwords.
Next, we create a term document matrix and convert it to a frequency matrix. Finally, we create a
word cloud using the 'wordcloud' function, where the 'words' parameter takes the names of the
words, and the 'freq' parameter takes the frequency of the words. We can also set parameters such
as the minimum and maximum frequency of words to include in the cloud, the maximum number
of words to show, the rotation of the words, and the color palette.

7.4 Sentiment Analysis Using R


The Sentiment package for R is beneficial in analyzing text for psychological or sociological studies.
Its first big advantage is that it makes sentiment analysis simple and achievable within a few lines
of code. Its second big advantage is that it corrects for inversions, meaning that while a more basic
sentiment R analysis would judge “I am not good” as positive due to the adjective good,
SentimentR recognizes the inversion of good and classifies it as negative.
Example-1
Here's a practical example of how R programming can be used for sentiment analysis on customer
reviews:
Suppose you work for a hotel chain and you want to analyze customer reviews to understand their
satisfaction levels. You have collected a dataset of customer reviews from various online sources,
including Booking.com, TripAdvisor, and Expedia.
Data cleaning: First, you need to clean and preprocess the data to remove any unwanted
characters, punctuation, and stop words. You can use the "tm" package in R to perform this step:
library(tm)
# Read in the raw text data
raw_data<- readLines("hotel_reviews.txt")
# Create a corpus object

128 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

corpus <- Corpus(VectorSource(raw_data))


# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove any additional custom words or patterns
corpus <- tm_map(corpus, removeWords, c("hotel", "room", "stay", "staff"))
# Convert back to plain text
clean_data<- as.character(corpus)
Sentiment analysis: Next, you can use the "tidytext" package in R to perform sentiment analysis on
the cleaned data. This package provides a pre-trained sentiment lexicon, which you can use to
assign a positive or negative sentiment score to each word in the text data:
library(tidytext)
# Load the sentiment lexicon
sentiments <- get_sentiments("afinn")
# Convert the cleaned data to a tidy format
tidy_data<- tibble(text = clean_data) %>%
unnest_tokens(word, text)
# Join the sentiment lexicon to the tidy data
sentiment_data<- tidy_data %>% inner_join(sentiments)
# Aggregate the sentiment scores at the review level
review_sentiments<- sentiment_data %>%
group_by(doc_id) %>%
summarize(sentiment_score = sum(value))
Visualization: Finally, you can use the "ggplot2" package in R to create a visualization of the
sentiment analysis results, such as a histogram or a word cloud:
library(ggplot2)
# Create a histogram of sentiment scores
ggplot(review_sentiments, aes(x = sentiment_score)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
labs(title = "Sentiment Analysis Results", x = "Sentiment Score", y = "Number of Reviews")
# Create a word cloud of the most frequent positive and negative words
positive_words<- sentiment_data %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%

LOVELY PROFESSIONAL UNIVERSITY 129


Notes

Business Analytics

head(20)
negative_words<- sentiment_data %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
head(20)
wordcloud(positive_words$word, positive_words$n, scale=c(4,0.5), min.freq = 1, colors =
brewer.pal(8, "Dark2"))
wordcloud(negative_words$word, negative_words$n, scale=c(4,0.5), min.freq = 1, colors =
brewer.pal(8, "Dark2"))
By using R programming for sentiment analysis on customer reviews, you can gain insights into the
overall sentiment of customers towards your hotel chain, identify common
OUTPUT
World Cloud

Most of the words are indeed related to the hotels: room, staff, breakfast, etc. Some words are more
related to the customer experience with the hotel stay: perfect, loved, expensive, dislike, etc.
Sentiment Score

130 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

The above graph shows the distribution of the reviews sentiments among good reviews and bad
ones. We can see that good reviews are for most of them considered as very positive by Vader. On
the contrary, bad reviews tend to have lower compound sentiment scores.
Example-2
Here's an example of sentiment analysis on a real-world dataset of tweets related to the COVID-19
pandemic using R programming:
Data collection: First, we need to collect a dataset of tweets related to COVID-19. We can use the
Twitter API to collect the data, or we can use a pre-existing dataset such as the one available on
Kaggle.
# Load the dataset
tweets <- read.csv("covid19_tweets.csv")
# Filter for English-language tweets
tweets <- tweets %>% filter(lang == "en")
Data cleaning: Next, we need to clean and preprocess the text data to remove any unwanted
characters, punctuation, and stop words. We can use the "tm" package in R to perform this step:
# Create a corpus object
corpus <- Corpus(VectorSource(tweets$text))
# Convert text to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
corpus <- tm_map(corpus, removeURL)
# Remove usernames
corpus <- tm_map(corpus, removeTwitterUser)
# Remove hashtags
corpus <- tm_map(corpus, removeHashTags)
# Remove stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Convert back to plain text
clean_data<- as.character(corpus)
Sentiment analysis: We can use the "tidytext" package in R to perform sentiment analysis on the
cleaned data. This package provides a pre-trained sentiment lexicon, which we can use to assign a
positive or negative sentiment score to each word in the text data:
# Load the sentiment lexicon
sentiments <- get_sentiments("afinn")
# Convert the cleaned data to a tidy format
tidy_data<- tibble(text = clean_data) %>%
unnest_tokens(word, text)

LOVELY PROFESSIONAL UNIVERSITY 131


Notes

Business Analytics

# Join the sentiment lexicon to the tidy data


sentiment_data<- tidy_data %>%
inner_join(sentiments)
# Aggregate the sentiment scores at the tweet level
tweet_sentiments<- sentiment_data %>%
group_by(doc_id) %>%
summarize(sentiment_score = sum(value))
Visualization: We can use the "ggplot2" package in R to create a visualization of the sentiment
analysis results, such as a histogram or a time series plot:
library(ggplot2)
# Create a histogram of sentiment scores
ggplot(tweet_sentiments, aes(x = sentiment_score)) +
geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
labs(title = "Sentiment Analysis Results", x = "Sentiment Score", y = "Number of Tweets")
# Create a time series plot of sentiment scores over time
tweets$date<- as.Date(tweets$date, "%Y-%m-%d")
sentiment_ts<- tweet_sentiments %>%
left_join(tweets %>% select(doc_id, date)) %>%
group_by(date) %>%
summarize(sentiment_score = mean(sentiment_score))
ggplot(sentiment_ts, aes(x = date, y = sentiment_score)) +
geom_line(color = "lightblue") +
labs(title = "Sentiment Analysis Results", x = "Date", y = "Sentiment Score")
OUTPUT
The output of the sentiment analysis would be a dataset containing the sentiment score for each
tweet, where a positive score indicates a positive sentiment and a negative score indicates
The output of the sentiment analysis in the example above is a dataset containing the sentiment
score for each tweet, where a positive score indicates a positive sentiment and a negative score
indicates a negative sentiment. The sentiment score is calculated by summing the sentiment scores
of each word in the tweet, as assigned by the AFINN sentiment lexicon.

132 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

The first visualization in the example is a histogram of sentiment scores, which shows the
distribution of sentiment in the dataset. The x-axis represents the sentiment score, and the y-axis
represents the number of tweets with that score. The histogram is colored in light blue and has
black borders.
The histogram shows that the sentiment scores in the dataset are mostly centered around zero,
indicating a neutral sentiment. However, there are some tweets with a positive sentiment score and
some tweets with a negative sentiment score, suggesting that there is some variation in the
sentiment of the tweets related to COVID-19.

The second visualization in the example is a time series plot of sentiment scores over time. The x-
axis represents the date of the tweet, and the y-axis represents the average sentiment score for
tweets posted on that day. The plot is colored in light blue and has a solid line connecting the
points.
The time series plot shows that the sentiment of the tweets related to COVID-19 has fluctuated over
time. There are some periods where the sentiment is more positive, such as in March 2020 when the
pandemic was first declared, and other periods where the sentiment is more negative, such as in
January 2021 when new variants of the virus were identified. The plot can help identify trends in

LOVELY PROFESSIONAL UNIVERSITY 133


Notes

Business Analytics

the sentiment of the tweets related to COVID-19 over time, which can be useful for understanding
public opinion and sentiment around the pandemic.

7.5 Topic Modelling and TDM Analysis


In text mining, we often have collections of documents, such as blog posts or news articles, that
we’d like to divide into natural groups so that we can understand them separately. Topic modeling
is a method for unsupervised classification of such documents, similar to clustering on numeric
data, which finds natural groups of items even when we’re not sure what we’re looking for.
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats
each document as a mixture of topics, and each topic as a mixture of words. This allows documents
to “overlap” each other in terms of content, rather than being separated into discrete groups, in a
way that mirrors typical use of natural language.

Latent Dirichlet allocation


Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without
diving into the math behind the model, we can understand it as being guided by two principles.
Every document is a mixture of topics. We imagine that each document may contain words from
several topics in particular proportions. For example, in a two-topic model we could say
“Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
Every topic is a mixture of words. For example, we could imagine a two-topic model of American
news, with one topic for “politics” and one for “entertainment.” The most common words in the
politics topic might be “President”, “Congress”, and “government”, while the entertainment topic
may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be
shared between topics; a word like “budget” might appear in both equally.

7.6 Topic Modelling Using R


R provides several packages that can be used for topic modeling, including the "tm" package,
"topicmodels" package, and "lda" package.
Example-1
Load the necessary packages
library(tm)
library(topicmodels)

134 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

Loading data and text preprocessing


# load data
textdata<- base::readRDS(url("https://slcladal.github.io/data/sotu_paragraphs.rda", "rb"))
# load stopwords
english_stopwords<- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding
= "UTF-8")
# create corpus object
corpus <- Corpus(DataframeSource(textdata))
# Preprocessing chain
processedCorpus<- tm_map(corpus, content_transformer(tolower))
processedCorpus<- tm_map(processedCorpus, removeWords, english_stopwords)
processedCorpus<- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes =
TRUE)
processedCorpus<- tm_map(processedCorpus, removeNumbers)
processedCorpus<- tm_map(processedCorpus, stemDocument, language = "en")
processedCorpus<- tm_map(processedCorpus, stripWhitespace)
Convert the text data into a document-term matrix
dtm<- DocumentTermMatrix(corpus)
Perform topic modeling using the LDA algorithm
lda_model<- LDA(dtm, k = 5, method = "Gibbs", control = list(seed = 1234))
In the above code, we specified k=5 to indicate that we want to extract 5 topics from the text data.
We also set the seed value to ensure reproducibility.
Print the top words in each topic
terms <- terms(lda_model, 10)
for (i in 1:5) {
cat(paste("Topic", i, ": ", sep = ""))
print(terms[, i])
}
This will print the top 10 words in each topic.
There are many other options and variations available for topic modeling in R, but this should
provide a basic introduction to the process.
OUTPUT

LOVELY PROFESSIONAL UNIVERSITY 135


Notes

Business Analytics

Summary
Text analytics, also known as text mining, is the process of analyzing unstructured text data to
extract meaningful insights and patterns. It involves applying statistical and computational
techniques to text data to identify patterns and relationships between words and phrases, and to
uncover insights that can help organizations make data-driven decisions.
Text analytics can be used for a wide range of applications, such as sentiment analysis, topic
modeling, named entity recognition, and event extraction. Sentiment analysis involves identifying
the sentiment of text data, whether it is positive, negative, or neutral. Topic modeling involves
identifying topics or themes within a text dataset, while named entity recognition involves
identifying and classifying named entities, such as people, organizations, and locations. Event
extraction involves identifying and extracting events and their related attributes from text data.
Text analytics can provide valuable insights for businesses, such as identifying customer
preferences and opinions, understanding market trends, and detecting emerging issues and
concerns. It can also help organizations monitor their brand reputation, improve customer service,
and optimize their marketing strategies.
Text analytics can be performed using various programming languages and tools, such as R,
Python, and machine learning libraries. It requires a combination of domain knowledge, statistical
and computational expertise, and creativity in identifying relevant patterns and relationships
within text data.
In summary, text analytics is a powerful tool for analyzing and extracting insights from
unstructured text data. It has a wide range of applications in business and can help organizations
make data-driven decisions, improve customer service, and optimize their marketing strategies.

Keywords
Text Analytics: The process of analyzing unstructured text data to extract meaningful insights and
patterns.
Sentiment Analysis: The process of identifying and extracting the sentiment of text data, whether it
is positive, negative, or neutral.
Topic Modeling: The process of identifying topics or themes within a text dataset.
Named Entity Recognition: The process of identifying and classifying named entities, such as
people, organizations, and locations, in a text dataset.
Event Extraction: The process of identifying and extracting events and their related attributes from
text data.

136 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

Natural Language Processing (NLP): The use of computational techniques to analyze and
understand natural language data.
Machine Learning: The use of algorithms and statistical models to learn patterns and insights from
data.
Corpus: A collection of text documents used for analysis.
Term Document Matrix: A matrix representation of the frequency of terms in a corpus.
Word Cloud: A visual representation of the most frequently occurring words in a corpus, with
larger font sizes indicating higher frequency.

SelfAssessment
1. What is text analytics?
A. The process of analyzing structured data
B. The process of analyzing unstructured text data
C. The process of analyzing both structured and unstructured data
D. The process of creating structured data from unstructured text data

2. What is sentiment analysis?


A. The process of identifying topics or themes within a text dataset
B. The process of identifying and classifying named entities in a text dataset
C. The process of identifying and extracting events and their related attributes from text data
D. The process of identifying and extracting the sentiment of text data

3. What is topic modeling?


A. The process of identifying and classifying named entities in a text dataset
B. The process of identifying and extracting events and their related attributes from text data
C. The process of identifying topics or themes within a text dataset
D. The process of identifying the sentiment of text data

4. What is named entity recognition?


A. The process of identifying topics or themes within a text dataset
B. The process of identifying and extracting events and their related attributes from text data
C. The process of identifying and classifying named entities in a text dataset
D. The process of identifying the sentiment of text data

5. What is event extraction?


A. The process of identifying topics or themes within a text dataset
B. The process of identifying and classifying named entities in a text dataset
C. The process of identifying and extracting events and their related attributes from text data
D. The process of identifying the sentiment of text data

6. What is the purpose of natural language processing (NLP)?


A. To analyze and understand natural language data
B. To create structured data from unstructured text data
C. To analyze and understand structured data
D. To transform data from one format to another

LOVELY PROFESSIONAL UNIVERSITY 137


Notes

Business Analytics

7. What is machine learning?


A. The use of computational techniques to analyze and understand natural language data
B. The use of algorithms and statistical models to learn patterns and insights from data
C. The process of identifying and extracting events and their related attributes from text data
D. The process of identifying and classifying named entities in a text dataset

8. What is a corpus in the context of text analytics?


A. A collection of structured data
B. A collection of unstructured text data
C. A visual representation of the most frequently occurring words in a dataset
D. A matrix representation of the frequency of terms in a dataset

9. Which R package is commonly used for text analytics?


A. ggplot2
B. dplyr
C. tidyr
D. tm

10. What does the function Corpus() do in R?


A. It creates a word cloud.
B. It creates a term-document matrix.
C. It creates a corpus of text documents.
D. It performs sentiment analysis.

11. Which function in R is used to preprocess text data by removing stop words and
stemming?
A. tm_map()
B. corpus()
C. termFreq()
D. wordcloud()

12. What is a term-document matrix in R?


A. A matrix that represents the frequency of terms in a corpus.
B. A matrix that represents the frequency of documents in a corpus.
C. A matrix that represents the frequency of sentences in a corpus.
D. A matrix that represents the frequency of paragraphs in a corpus.

13. What does the function wordcloud() do in R?


A. It creates a term-document matrix.
B. It performs sentiment analysis.
C. It creates a word cloud.
D. It preprocesses text data.

14. Which R package is used for sentiment analysis?


A. ggplot2
B. dplyr

138 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

C. tidyr
D. SentimentAnalysis

15. What is the purpose of the function removeWords() in R?


A. To remove stop words from text data.
B. To remove numbers from text data.
C. To remove punctuation from text data.
D. To remove specific words from text data
.
16. What is the purpose of the function findAssocs() in R?
A. To find associations between words in a corpus.
B. To find associations between documents in a corpus.
C. To find associations between sentences in a corpus.
D. To find associations between paragraphs in a corpus.

17. What does the function findFreqTerms() do in R?


A. It finds the most frequent terms in a corpus.
B. It finds the most frequent documents in a corpus.
C. It finds the most frequent sentences in a corpus.
D. It finds the most frequent paragraphs in a corpus.

18. What is the purpose of the function LDA() in R?


A. To perform topic modeling.
B. To perform sentiment analysis.
C. To perform named entity recognition.
D. To perform event extraction.

19. Which R package is commonly used for topic modeling?


A. plyr
B. ggplot2
C. topicmodels
D. dplyr

20. What is a document-term matrix in topic modeling?


A. A matrix that represents the distribution of topics in each document
B. A matrix that represents the frequency of each term in each document
C. A matrix that represents the distribution of terms in each topic
D. A matrix that represents the probability of each topic appearing in each term

21. What is LDA in topic modeling?


A. A data analysis technique used to identify linear relationships between variables
B. A machine learning algorithm used for clustering data points
C. A statistical model used for predicting categorical outcomes
D. A probabilistic model used for discovering latent topics in a corpus of text data

22. What is the purpose of the control argument in the LDA function in R?

LOVELY PROFESSIONAL UNIVERSITY 139


Notes

Business Analytics

A. To specify the number of topics to extract from the text data


B. To set the seed value for reproducibility
C. To specify the algorithm to use for topic modeling
D. To set the convergence criteria for the model estimation

23. Which function in R can be used to print the top words in each topic after performing
topic modeling?
A. topics()
B. top.words()
C. lda_terms()
D. terms()

Answers forSelf Assessment


l. B 2. D 3. C 4. C 5. C

6. A 7. B 8. B 9. D 10. C

11. A 12. A 13. C 14. D 15. D

16. A 17. A 18. A 19. C 20. B

21. D 22. B 23. D

Review Questions
1) What are the common steps involved in topic modeling using R?
2) How can you preprocess text data for topic modeling in R?
3) What is a document-term matrix, and how is it used in topic modeling?
4) What is LDA, and how is it used for topic modeling in R?
5) How do you interpret the output of topic modeling in R, including the document-topic
matrix and top words in each topic?
6) What are some common techniques for evaluating the quality of topic modeling results in
R?
7) Can you describe some potential applications of topic modeling in various fields, such as
marketing, social sciences, or healthcare?
8) How can you visualize the results of topic modeling in R?
9) What are some best practices to follow when performing topic modeling in R, such as
choosing the optimal number of topics and tuning model parameters?
10) What are some common challenges in text analytics and how can they be addressed?

Further Readings
"Text Mining with R: A Tidy Approach" by Julia Silge and David Robinson - This book
provides a comprehensive introduction to text mining with R programming, covering
topics such as sentiment analysis, topic modeling, and natural language processing.

140 LOVELY PROFESSIONAL UNIVERSITY


Notes

Unit 07: Text Analytics for Business

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media, Inc.
Sarkar, D. (2019). Text analytics with Python: A practical real-world approach to gaining
actionable insights from your data. Apress.
Manning, C. D., Raghavan, P., &Schütze, H. (2008). Introduction to information retrieval.
Cambridge University Press.
Berry, M. W., & Castellanos, M. (2008). Survey of text mining: Clustering, classification,
and retrieval. Springer.

LOVELY PROFESSIONAL UNIVERSITY 141


Notes
Dr. Mohd Imran Khan, Lovely Professional University Unit 08: Business Intelligence

Unit 08: BusinessIntelligence


CONTENTS
Introduction
8.1 BI- Importance
8.2 BI – Advantages
8.3 Business Intelligence - Disadvantages
8.4 Environmental Factors Affecting Business Intelligence
8.5 Common Mistakes in Implementing Business Intelligence
8.6 Business Intelligence – Applications
8.7 Recent Trends in Business Intelligence
8.8 Similar BI systems
8.9 Business Intelligence Applications
Summary
Keywords
Self Assessment
Answers for Self Assessment
Review Questions
Further Readings

Introduction
Decisions drive organizations. Making a good decision at a critical moment may lead to a more
efficient operation, a more profitable enterprise, or perhaps a more satisfied customer. So, it only
makes sense that the companies that make better decisions are more successful in the long run.
That’s where business intelligence comes in. Business intelligence is defined in various ways (our
chosen definition is in the next section). For the moment, though, think of BI as using data about
yesterday and today to make better decisions about tomorrow. Whether it’s selecting the right
criteria to judge success, locating and transforming the appropriate data to draw conclusions, or
arranging information in a manner that best shines a light on the way forward, business
intelligence makes companies smarter. It allows managers to see things more clearly, and permits
them a glimpse of how things will likely be in the future.
Business intelligence is a flexible resource that can work at various organizational levels and
various times — these, for example:

 A sales manager is deliberating over which prospects the account executives should focus on
in the final-quarter profitability push
 An automotive firm’s research-and-development team is deciding which features to include in
next year’s sedan
 The fraud department is deciding on changes to customer loyalty programs that will root out
fraud without sacrificing customer satisfaction
BI (Business Intelligence) is a set of processes, architectures, and technologies that convert raw data
into meaningful information that drives profitable business actions. It is a suite of software and
services to transform data into actionable intelligence and knowledge.
Business Intelligence evolved over period as:

142 LOVELY PROFESSIONAL UNIVERSITY


Notes
Business Analytics

BI has a direct impact on organization’s strategic, tactical and operational business decisions. BI
supports fact-based decision making using historical data rather than assumptions and gut feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and charts
to provide users with detailed intelligence about the nature of the business.

8.1 BI- Importance


Business intelligence plays very important role in businesses:

 Measurement: creating KPI (Key Performance Indicators) based on historic data


 Identify and set benchmarks for varied processes.
 With BI systems organizations can identify market trends and spot business problems that
need to be addressed.
 BI helps on data visualization that enhances the data quality and thereby the quality of
decision making.
 BI systems can be used not just by enterprises but SME (Small and Medium Enterprises)

8.2 BI – Advantages
Business Intelligence has following advantages:

1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click thus saves lots of
time and resources. It also allows employees to be more productive on their tasks.

2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to identify any areas
which need attention.

3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who should own
accountability and ownership for the organization’s performance against its set goals.

4. It gives a bird’s eye view:


BI system also helps organizations as decision-makers get an overall bird’s eye view through
typical BI features like dashboards and scorecards.

5. It streamlines business processes:

LOVELY PROFESSIONAL UNIVERSITY 143


Notes
Unit 08: Business Intelligence

BI takes out all complexity associated with business processes. It also automates analytics by
offering predictive analysis, computer modeling, benchmarking, and other methodologies.

6. It allows for easy analytics.


BI software has democratized its usage, allowing even nontechnical or non-analyst users to collect
and process data quickly. This also allows putting the power of analytics in the hand’s many
people.

8.3 Business Intelligence - Disadvantages


Business Intelligence has the following disadvantages:

1. Cost:
Business intelligence can prove costly for small as well as medium-sized enterprises. The use of a
such type of system may be expensive for routine business transactions.

2. Complexity:
Another drawback of BI is its complexity in the implementation of the data warehouse. It can be so
complex that it can make business techniques rigid to deal with.

3. Limited use
Like all improved technologies, BI was first established keeping in consideration the buying
competence of rich firms. Therefore, BI system is yet not affordable for many small and medium
size companies.

4. Time-Consuming Implementation
It takes almost one and a half year for the data warehousing system to be completely implemented.
Therefore, it is a time-consuming process.

8.4 Environmental Factors Affecting Business Intelligence


In order to understand how a holistic business intelligence strategy is possible, it’s first necessary to
understand the four environmental factors that can impact such a strategy.
The four areas can be broken down into two main categories: internal and external. Internal factors
are those that the company has direct control over, while external factors are out of the company’s
control but can still impact the business intelligence strategy.
The four environmental factors are:

Data:
This is the most important factor in business intelligence, as without data there is nothing to
analyze or report on. Data can come from a variety of sources, both internal and external to the
organization.

144 LOVELY PROFESSIONAL UNIVERSITY


Notes
Business Analytics

Data
Characteristics

Business intelligence thrives on data. Without data, there is nothing to analyze or report on. Data
can come from a variety of sources, both internal and external to the organization. Internal data
sources can include things like transaction data, customer data, financial data, and operational data.
External data sources can include public records, social media data, market research data, and
competitor data.
The data gathering process must be designed to collect the right data from the right sources. Once
the data is gathered, it must then be cleaned and standardized so that it can be properly analyzed.
In the BI environment, data is king. All other factors must be aligned in order to support the data
and help it reach its full potential.

People:
The people involved in business intelligence play a critical role in its success. From the data
analysts who gather and clean the data, to the business users who interpret and use the data to
make decisions, each person involved must have a clear understanding of their role in the process.
The people involved in business intelligence play a critical role in its success. From the data
analysts who gather and clean the data, to the business users who interpret and use the data to
make decisions, each person involved must have a clear understanding of their role in the process.
The data analysts need to be able to collect data from all relevant sources, clean and standardize the
data, and then load it into the BI system. The business users need to be able to access the data,
understand what it means, and use it to make decisions.
This can take many forms, but a key component is data literacy: the ability to read, work with,
analyze, and argue with data. Data literacy is essential for business users to be able to make
decisions based on data.
In a successful business intelligence environment, people are trained and empowered to use data to
make decisions.

Processes:

Database Types

LOVELY PROFESSIONAL UNIVERSITY 145


Notes
Unit 08: Business Intelligence

Relational Database
Object-Oriented Database
Distributed Database
NoSQL Database
Graph Database
Cloud Database
Centralization Database
Operational Database
The processes used to gather, clean, and analyze data must be well-designed and efficient in order
to produce accurate and timely results.
The processes used to gather, clean, and analyze data must be well-designed and efficient in order
to produce accurate and timely results. The data gathering process must be designed to collect the
right data from the right sources. Once the data is gathered, it must then be cleaned and
standardized so that it can be properly analyzed.
The data analysis process must be designed to answer the right questions. The results of the data
analysis must be presented in a way that is easy to understand and use

Technology:
The technology used to support business intelligence must be up to date and able to handle the
volume and complexity of data.
The technology used to support business intelligence must be up to date and able to handle the
volume and complexity of data. The BI system must be able to collect data from all relevant
sources, clean and standardize the data, and then load it into the system. The system must be able
to support the data analysis process and provide easy-to-use tools for business users to access and
analyze the data.

You want your BI technology to offer features such as self-service analytics, predictive analytics,
and social media integration. However, the technology must be easy enough to use that business
users don’t need a Ph.D. to use it. In a successful business intelligence environment, the technology
is easy to use and provides the features and functionality needed to support the data gathering,
analysis, and reporting process.

146 LOVELY PROFESSIONAL UNIVERSITY

You might also like