0% found this document useful (0 votes)

21 views180 pages

tanujit_caiml (1)

Basic of data analytics and statistics is given in this book, helpful for those who want to learn basics , practice more from this and get the concepts easily, this pdf is helpful for entrance exams

Uploaded by

ashutoshsharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views180 pages

tanujit_caiml (1)

Basic of data analytics and statistics is given in this book, helpful for those who want to learn basics , practice more from this and get the concepts easily, this pdf is helpful for entrance exams

Uploaded by

ashutoshsharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 180

Basic Statistical Techniques

Hands-on-Session with R-Studio

Tanujit Chakraborty
PhD Scholar, Indian Statistical Institute, Kolkata.
Webpage : www.ctanujit.org
Mail : [email protected]
1 CAIML talk by Tanujit Chakraborty
Course Outline

Chapter Topic Chapter Topic

1 Introduction to RStudio 6 Normality Tests

2 Descriptive Statistics 7 Analysis of Variance

3 Data Visualization 8 Regression Analysis

Data Pre-processing
4 9 Binary Logistic Regression

Test of Hypothesis
5 10 Ordinal Logistic Regression

CAIML talk by Tanujit Chakraborty 2

1. INTRODUCTION TO RSTUDIO

CAIML talk by Tanujit Chakraborty 3

INSTALLATION

1.Download R software from http://cran.r-project.org/bin/windows/base/

2.Run the R set up (exe) file and follow instructions
3.Double click on the R icon in the desktop and R window will open
4.Download RStudio from http://www.rstudio.com/
5.Run R studio set up file and follow instructions
6.Click on R studio icon, R Studio IDE Studio will load
7.Tools – Global Options – Appearances – Change Colour Size Theme
(if you wish to change the background, not a mandatory step)
4.Go to R-Script (Ctrl + Shift + N)
5. Write “Hello World !”
6.Save & Run (Ctrl + Enter)

Congrats ! You have written your very first R-Program

CAIML talk by Tanujit Chakraborty 4

BASIC TASKS

Matrix multiplication – Code

Read the matrix A and B
A = matrix(c(21,57,89,31,7,98), nrow =2, ncol=3, byrow = TRUE)
B = matrix(c(24, 35, 15, 34, 56,25), nrow = 3, ncol = 2, byrow = TRUE)
Multiplication of matrices
C = A%*%B
C
Determinant – R Code
A = matrix(c(51, 10, 23, 64), nrow = 2, ncol =2, byrow =TRUE)
det(A)
Matrix Inverse – R code
A = matrix(c(51, 10, 23, 64), nrow = 2, ncol =2, byrow =TRUE)
solve(A)
CAIML talk by Tanujit Chakraborty 5
BASIC TASKS

Eigen values and Eigen vectors – R Code

A = matrix(c(1, -2, 3, -4), nrow = 2, ncol = 2, byrow = TRUE)
eigen(A)
Generating 5 Random Numbers – R Code
x = rnorm(5, mean = 0, sd = 1)
x

Functional Help
?rnorm()
Package Installation
install.packages("ggplot")
Library Call (for use)
library(ggplot) CAIML talk by Tanujit Chakraborty 6
2. DESCRIPTIVE STATISTICS

CAIML talk by Tanujit Chakraborty 7

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Exercise 1: The monthly credit card expenses of an individual in 1000 rupees is

given in the file Credit_Card_Expenses.csv.

a.Read the dataset to R studio

b.Compute mean, median minimum, maximum, range, variance, standard
deviation, skewness, kurtosis and quantiles of Credit Card Expenses
c.Compute default summary of Credit Card Expenses
d.Draw Histogram of Credit Card Expenses

CAIML talk by Tanujit Chakraborty 8

Indian Statistical Institute
DESCRIPTIVE STATISTICS

The monthly credit card expenses of an individual in 1000 rupees is given below.
Kindly summarize the data

Month Credit Card Expenses Month Credit Card Expenses

1 55 11 63
2 65 12 55
3 59 13 61
4 59 14 61
5 57 15 57
6 61 16 59
7 53 17 61
8 63 18 57
9 59 19 59
10 57 20 63

CAIML talk by Tanujit Chakraborty 9

Indian Statistical Institute
DESCRIPTIVE STATISTICS

Reading a csv file to R Studio

The file open dialog box will pop up

Browse to the file
CAIML talk by Tanujit Chakraborty 10
Indian Statistical Institute
DESCRIPTIVE STATISTICS

Reading a csv file to R Studio

Click Import button

R studio will read the data set to a data frame with specified name
CAIML talk by Tanujit Chakraborty 11
Indian Statistical Institute
DESCRIPTIVE STATISTICS

Reading a csv file to R Studio : Source code

Credit_Card_Expenses <- read.csv(“C:/Desktop/Data/Credit_Card_Expenses.csv")

To change the name of the data set to : mydata

> mydata = Credit_Card_Expenses

To display the contents of the data set

> print(mydata)

To read a particular column or variable of data set to a ne variable Example: Read

CC_Expenses to CC
>CC = mydata$CC_Expenses

CAIML talk by Tanujit Chakraborty 12

Indian Statistical Institute
DESCRIPTIVE STATISTICS

Reading data from MS Excel formats to R Studio

Format Code
Excel library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", “Sheet1”)

Reading data from databases to R Studio

Function Description
Open a connection to an ODBC database
odbcConnect(dsn, uid="", pwd="")
Read a table from an ODBC database into a data frame
sqlFetch(channel, sqtable)

Submit a query to an ODBC database and return the

sqlQuery(channel, query) results

sqlSave(channel, mydf, tablename = Write or update (append=True) a data frame to a table in

sqtable, append = FALSE) the ODBC database
sqlDrop(channel, sqtable) Remove a table from the ODBC database

close(channel) Close the connection

CAIML talk by Tanujit Chakraborty 13

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Operators - Arithmetic

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2

CAIML talk by Tanujit Chakraborty 14

Indian Statistical Institute
DESCRIPTIVE STATISTICS

Operators - Logical

Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if X is TRUE

CAIML talk by Tanujit Chakraborty 15

Indian Statistical Institute
DESCRIPTIVE STATISTICS

Descriptive Statistics
Computation of descriptive statistics for variable CC

Function Code Value

Mean > mean(CC) 59.2
Median > median(CC) 59
Standard deviation > sd(CC) 3.105174
Variance > var(CC) 9.642105
Minimum > min(CC) 53
Maximum > max(CC) 65
Range > range(CC) 53 65

CAIML talk by Tanujit Chakraborty 16

Indian Statistical Institute
DESCRIPTIVE STATISTICS

Descriptive Statistics
Function Code
Quantile > quantile(CC)

Output
Quantile 0% 25% 50% 75% 100%
Value 53 57 59 61 65

Function Code

Summary >summary(CC)

Output
Minimum Q1 Median Mean Q3 Maximum
53 57 59 59.2 61 65
CAIML talk by Tanujit Chakraborty 17
Indian Statistical Institute
DESCRIPTIVE STATISTICS

Descriptive Statistics
Function Code Output
Statistics Values
describe > library(psych)
> describe(CC) N 20
mean 59.2
sd 3.11
median 59
Trimmed 59.25
mad 2.97
min 53
Max 65
Range 12
Skew -0.08
Kurtosis -0.85
se 0.69

CAIML talk by Tanujit Chakraborty 18

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Graphs

Graph Code
Histogram > hist(CC)
Histogram colour (“Blue”) > hist(CC,col="blue")
Dot plot > dotchart(CC)
Box plot > boxplot(CC)
Box plot colour > boxplot(CC, col="dark green")

CAIML talk by Tanujit Chakraborty 19

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Histogram : Variable - CC

CAIML talk by Tanujit Chakraborty 20

Indian Statistical Institute

DESCRIPTIVE STATISTICS

Box plot : Variable - CC

CAIML talk by Tanujit Chakraborty 21

Indian Statistical Institute

3. DATA VISUALIZATION

CAIML talk by Tanujit Chakraborty 22

DATA VISUALIZATION Indian Statistical Institute

With ever increasing volume of data, it is impossible to tell stories without

visualizations. Data visualization is an art of how to turn numbers into useful
knowledge.

Popular Data Visualization Techniques:

1. Scatter Plot
2. Histogram
3. Bar & Stack Bar Chart
4. Box Plot
5. Area Chart
6. HeatMap
7. Correlogram

CAIML talk by Tanujit Chakraborty 23

DATA VISUALIZATION Indian Statistical Institute

We’ll use ‘Big_Mart_Dataset.csv’ example as shown below to understand how

to create visualizations.

CAIML talk by Tanujit Chakraborty 24

DATA VISUALIZATION Indian Statistical Institute

1. Scatter Plot: It is used to see the relationship between two continuous

variables. In our above mart dataset, if we want to visualize the items as per
their cost data, then we can use scatter plot chart using two continuous
variables, namely Item_Visibility & Item_MRP as shown.

Read data and simple scatter plot using function ggplot() with geom_point().
> train <-
read.csv("C:/Users/ISIUSER3/Desktop/CAIML_2019/Data/Big_Mart_Dataset.c
sv")
> view(train)
> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point() +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
CAIML talk by Tanujit Chakraborty 25
DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 26

DATA VISUALIZATION Indian Statistical Institute

1. Scatter Plot: Now, we can view a third variable also in same chart, say a
categorical variable (Item_Type) which will give the characteristic (item_type)
of each data set. Different categories are depicted by way of different color for
item_type in below chart.

Another scatter plot using function ggplot() with geom_point().

> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color =
Item_Type)) + scale_x_continuous("Item Visibility", breaks =
seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by =
30))+ theme_bw() + labs(title="Scatterplot")

CAIML talk by Tanujit Chakraborty 27

DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 28

DATA VISUALIZATION Indian Statistical Institute

1. Scatter Plot: We can even make it more visually clear by creating separate
scatter plots for each separate Item_Type as shown below.

Another scatter plot using function ggplot() with geom_point().

 library(ggplot2)

 ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color =

Item_Type)) + scale_x_continuous("Item Visibility", breaks =
seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by
= 30))+ theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

Here, facet_wrap works well & wraps Item_Type in rectangular layout.

CAIML talk by Tanujit Chakraborty 29
DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 30

DATA VISUALIZATION Indian Statistical Institute

2. Histogram: It is used to plot continuous variable. It breaks the data into bins
and shows frequency distribution of these bins. We can always change the bin
size and see the effect it has on visualization.

For Big_Mart_Dataset, if we want to know the count of items on basis of their

cost, then we can plot histogram using continuous variable Item_MRP as
shown below.

Histogram plot using function ggplot() with geom_ histogram()

> ggplot(train, aes(Item_MRP)) + geom_histogram(binwidth = 2)+

scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+
scale_y_continuous("Count", breaks = seq(0,200,by = 20))+ labs(title =
"Histogram")
CAIML talk by Tanujit Chakraborty 31
DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 32

DATA VISUALIZATION Indian Statistical Institute

3. Bar Chart: It is used when you want to plot a categorical variable or a

combination of continuous and categorical variable.

For Big_Mart_Dataset, if we want to know item weights (continuous variable)

on basis of Outlet Type (categorical variable) on single bar chart as shown
below.

Vertical Bar plot using function ggplot()

> ggplot(train, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill =

"darkblue") + scale_x_discrete("Outlet Type")+ scale_y_continuous("Item
Weight", breaks = seq(0,15000, by = 500))+ theme(axis.text.x =
element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")

CAIML talk by Tanujit Chakraborty 33

DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 34

DATA VISUALIZATION Indian Statistical Institute

3. Stack Bar Chart: It is an advanced version of bar chart, used for visualizing
a combination of categorical variables.

For Big_Mart_Dataset, if we want to know the count of outlets on basis of

categorical variables like its type (Outlet Type) and location (Outlet Location
Type) both, stack chart will visualize the scenario in most useful manner.

Stack Bar Chart using function ggplot()

> ggplot(train, aes(Outlet_Location_Type, fill = Outlet_Type)) +

geom_bar()+labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y =
"Count of Outlets")

CAIML talk by Tanujit Chakraborty 35

DATA VISUALIZATION Indian Statistical Institute

CAIML talk by Tanujit Chakraborty 36

DATA VISUALIZATION Indian Statistical Institute

4. Box Plot: It is used to plot a combination of categorical and continuous

variables. This plot is useful for visualizing the spread of the data and detect
outliers. It shows five statistically significant numbers- the minimum, the 25th
percentile, the median, the 75th percentile and the maximum.

For Big_Mart_Dataset, if we want to identify each outlet’s detailed item sales

including minimum, maximum & median numbers, box plot can be helpful. In
addition, it also gives values of outliers of item sales for each outlet as shown
in below chart.

R Code:
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill =
"red")+scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000,
by=500))+labs(title = "Box Plot", x = "Outlet Identifier")

CAIML talk by Tanujit Chakraborty 37

DATA VISUALIZATION Indian Statistical Institute

The black points are outliers. Outlier detection and removal is an essential step of
successful data exploration.
CAIML talk by Tanujit Chakraborty 38
DATA VISUALIZATION Indian Statistical Institute

5. Area Chart: It is used to show continuity across a variable or data set. It is

very much same as line chart and is commonly used for time series plots.
Alternatively, it is also used to plot continuous variables and analyse the
underlying trends.

For Big_Mart_Dataset, when we want to analyse the trend of item outlet sales,
area chart can be plotted as shown below. It shows count of outlets on basis of
sales.

R Code:
> ggplot(train, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill
= "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title =
"Area Chart", x = "Item Outlet Sales", y = "Count")

CAIML talk by Tanujit Chakraborty 39

DATA VISUALIZATION Indian Statistical Institute

Area chart shows continuity of Item Outlet Sales using function ggplot() with
geom_area.
CAIML talk by Tanujit Chakraborty 40
DATA VISUALIZATION Indian Statistical Institute

6. Heat Map: It uses intensity (density) of colours to display relationship

between two or three or many variables in a two dimensional image.

For Big_Mart_Dataset, if we want to know cost of each item on every outlet,

we can plot heatmap as shown below using three variables Item MRP, Outlet
Identifier & Item Type from our mart dataset.

R Code:

> ggplot(train, aes(Outlet_Identifier, Item_Type))+ geom_raster(aes(fill =

Item_MRP))+ labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")

CAIML talk by Tanujit Chakraborty 41

DATA VISUALIZATION Indian Statistical Institute

The dark portion indicates Item MRP is close 50. The brighter portion indicates
Item MRP is close to 250.
CAIML talk by Tanujit Chakraborty 42
DATA VISUALIZATION Indian Statistical Institute

7. Correlogram: It is used to test the level of co-relation among the variable

available in the data set. The cells of the matrix can be shaded or coloured to
show the co-relation value.

For Big_Mart_Dataset, check co-relation between Item cost, weight, visibility

along with Outlet establishment year and Outlet sales from below plot.

R Code for simple correlogram using function corrgram():

> install.packages("corrgram")
> library(corrgram)
> corrgram(train, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
CAIML talk by Tanujit Chakraborty 43
DATA VISUALIZATION Indian Statistical Institute

• Darker the colour, higher the co-relation between variables. Positive co-
relations are displayed in blue and negative correlations in red colour. Colour
intensity is proportional to the co-relation value.
• We can see that Item cost & Outlet sales are positively correlated while Item
weight & its visibility are negatively correlated.
44
CAIML talk by Tanujit Chakraborty
Indian Statistical Institute

4. DATA PRE-PROCESSING

CAIML talk by Tanujit Chakraborty 45

DATA PREPROCESSING Indian Statistical Institute

1. Missing value replenishment

2. Merging data files
3. Appending the data files
4. Transformation or normalization
5. Random Sampling

CAIML talk by Tanujit Chakraborty 46

Indian Statistical Institute
Missing Value Handling

Example: Suppose a telecom company wants to analyze the performance of its

circles based on the following parameters
1. Current Month’s Usage
2. Last 3 Month’s Usage
3. Average Recharge
4. Projected Growth
The data set is given in next slide. (Missing_Values_Telecom Data)

CAIML talk by Tanujit Chakraborty 47

Indian Statistical Institute
Missing Value Handling

Example: Circle wise Data

Current Last 3
Month's Month's Average Projected
SL No. Usage Usage Recharge Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
CAIML talk by Tanujit Chakraborty 48
Indian Statistical Institute
Missing Value Handling

Example: Read data and variables to R

> mydata = Missing_Values_Telecom

> cmusage = mydata[,2]
> l3musage = mydata[,3]
> avrecharge = mydata[,4]

CAIML talk by Tanujit Chakraborty 49

Indian Statistical Institute
Missing Value Handling

Option 1: Discard all records with missing values

>newdata = na.omit(mydata)
>write.csv(newdata,"E:/ISI/newdata.csv")
SL.No. Current.Month.s.Usage Last.3.Month.s.Usage Average.Recharge Projected.Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
4 4.6 3.1 98.5 9..2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
11 6.5 2.8 95.4 98.5 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C

CAIML talk by Tanujit Chakraborty 50

Indian Statistical Institute
Missing Value Handling

Option 2: Replace the missing values with variable mean, median, etc

Replacing the missing values with mean

Compute the means excluding ghe missing values

>cmusage_mean = mean(cmusage, na.rm = TRUE)
>l3musage_mean = mean(l3musage_mean, na.rm = TRUE)
> avrecharge_mean = mean(avrecharge, na.rm = TRUE)

Replace the missing values with mean

> cmusage[is.na(cmusage)]=cmusage_mean
> l3musage[is.na(l3musage)]= l3musage_mean
>avrecharge[is.na(avrecharge)]=avrecharge_mean

CAIML talk by Tanujit Chakraborty 51

Indian Statistical Institute
Missing Value Handling

Option 2: Replace the missing values with variable mean, median, etc

Replacing the missing values with mean

Replace the missing values with mean

> cmusage[is.na(cmusage)]=cmusage_mean
> l3musage[is.na(l3musage)]= l3musage_mean
>avrecharge[is.na(avrecharge)]=avrecharge_mean

Making the new file

> mynewdata = cbind(cmusage, l3musage, avrecharge, mydata[,5],mydata[,6])
> write.csv(mynewdata, "E:/ISI/mynewdata.csv")

CAIML talk by Tanujit Chakraborty 52

Indian Statistical Institute
Missing Value Handling

Option 2: Replace the missing values with variable mean, median, etc
Replacing the missing values with men
SL No cmusage l3musage avrecharge Proj Growth Circle
1 5.1 3.5 99.4 11 1
2 4.9 3 98.6 11 1
3 5.975 3.2 96.14117647 11 1
4 4.6 3.1 98.5 1 1
5 5 3.105882353 98.4 11 1
6 5.4 3.9 98.3 12 1
7 7 3.2 95.3 6 2
8 6.4 3.2 95.5 7 2
9 6.9 3.1 95.1 7 2
10 5.975 2.3 96 5 2
11 6.5 2.8 95.4 7 2
12 5.7 3.105882353 95.5 5 2
13 6.3 3.3 96.14117647 8 2
14 6.7 3.3 94.3 3 3
15 6.7 3 94.8 2 3
16 6.3 2.5 95 10 3
17 5.975 3 94.8 4 3
18 6.2 3.4 94.6 2 3
19 5.9 3 94.9 9 3

CAIML talk by Tanujit Chakraborty 53

Indian Statistical Institute

TRANSFORMATION / NORMALIZATION

z transform:
Transformed data = (Data – Mean) / SD

Exercise : Normalize the variables in the Supply_Chain.csv ?

Read the files

>mydata = Supply_Chain
> mydata = mydata[,2:7]

Normalize or standardize the variable

>mystddata = scale(mydata)

CAIML talk by Tanujit Chakraborty 54

Indian Statistical Institute

RANDOM SAMPLING

Example: Take a sample of size 60 (10%) randomly from the data given in the
file bank-data.csv and save it as a new csv file?

Read the files

>mydata = bank-data

> mysample = mydata[sample(1:nrow(mydata), 60, replace = FALSE),]

>write.csv(mysample,"E:/ISI/mysample.csv")

CAIML talk by Tanujit Chakraborty 55

Indian Statistical Institute

RANDOM SAMPLING

Example: Split randomly the data given in the file bank-data.csv into sets namely
training (75%) and test (25%) ?

Read the files

>mydata = bank-data

>sample = sample(2, nrow(mydata), replace = TRUE, prob = c(0.75, 0.25))

> sample1 = mydata[sample ==1, ]
> sample2 = mydata[sample ==2,]

CAIML talk by Tanujit Chakraborty 56

Indian Statistical Institute

5. TEST OF HYPOTHESIS

CAIML talk by Tanujit Chakraborty 57

Indian Statistical Institute

TEST OF HYPOTHESIS

Introduction:
In many situations, it is required to accept or reject a statement or claim about
some parameter

Example:
1. The average cycle time is less than 24 hours
2. The % rejection is only 1%
The statement is called the hypothesis
The procedure for decision making about the hypothesis is called hypothesis
testing
Advantages
1. Handles uncertainty in decision making
2. Minimizes subjectivity in decision making
3. Helps to validate assumptions or verify conclusions
CAIML talk by Tanujit Chakraborty 58
Indian Statistical Institute

TEST OF HYPOTHESIS

Commonly used hypothesis tests on mean of normal distribution:

• Checking mean equal to a specified value (mu = mu0)

• Two means are equal or not (mu1 = mu2)

CAIML talk by Tanujit Chakraborty 59

Indian Statistical Institute

TEST OF HYPOTHESIS

Null Hypothesis:
A statement about the status quo
One of no difference or no effect
Denoted by H0

Alternative Hypothesis:
One in which some difference or effect is expected
Denoted by H1

CAIML talk by Tanujit Chakraborty 60

Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology demo: To Test Mean = Specified Value (mu = mu0)

Suppose we want to test whether mean of a process characteristic is 5 based on
the following sample data from the process
4 4 5 5 6
5 4.5 6.5 6 5.5

Calculate the mean of the sample, xbar = 5.15

Compare xbar with specified value 5
or xbar - specified value = xbar - 5 with 0
If xbar - 5 is close to 0
then conclude mean = 5
else mean  5

CAIML talk by Tanujit Chakraborty 61

Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology demo : To Test Mean = Specified Value (mu = mu0)

Consider another set of sample data. Check whether mean of the process
characteristic is 500
400 400 500 500 600
500 450 650 600 550

Mean of the sample, xbar = 515

xbar - 500 = 515 - 500 = 15

Can we conclude mean  500?

Conclusion:
Difficult to say mean = specified value by looking at xbar - specified value
alone
CAIML talk by Tanujit Chakraborty 62
Indian Statistical Institute

TEST OF HYPOTHESIS

Methodology demo: To Test Mean = Specified Value (mu = mu0)

Test statistic is calculated by dividing (xbar - specified value) by a function

of standard deviation

To test Mean = Specified value

Test Statistic t0 = (xbar - Specified value) / (SD / n)

If test statistic is close to 0, conclude that Mean = Specified value

To check whether test statistic is close to 0, find out p value from the sampling
distribution of test statistic

CAIML talk by Tanujit Chakraborty 63

Indian Statistical Institute

TEST OF HYPOTHESIS
Methodology demo: To Test Mean = Specified Value
P value
The probability that such evidence or result will occur when H0 is true
Based on the reference distribution of test statistic
The tail area beyond the value of test statistic in reference distribution
4

1
P value

0
-0.2 0.1 0 1 2 3 4

t0 64
CAIML talk by Tanujit Chakraborty
Indian Statistical Institute

TEST OF HYPOTHESIS
Methodology demo : To Test Mean = Specified Value
P value
4

0
P value
-0.2 0.1 0 1 2 3 4

t0
If test statistic t0 is close to 0 then p will be high
If test statistic t0 is not close to 0 then p will be small
If p is small , p < 0.05 (with alpha = 0.05), conclude that t  0, then
Mean  Specified Value, H0 rejected
CAIML talk by Tanujit Chakraborty 65
Indian Statistical Institute

TEST OF HYPOTHESIS

To Test Mean = Specified Value (mu = mu0)

Example: Suppose we want to test whether mean of the process characteristic is

5 based on the following sample data
4 4 5 5 6
5 4.5 6.5 6 5.5

H0: Mean = 5
H1: Mean  5

Calculate xbar = 5.15

SD = 0.8515
n = 10
Test statistic t0 = (xbar - 5)/(SD / n) = (5.15 - 5) / (0.8515 / 10) = 0.5571

CAIML talk by Tanujit Chakraborty 66

Indian Statistical Institute

TEST OF HYPOTHESIS

Example: To Test Mean = Specified Value (mu = mu0)

t0 = 0.5571

P = 0.59
2

0
-0.2 0.1 0 1 2 3 4
0.55

P ≥ 0.05, hence Mean = Specified value = 5.

H0: Mean = 5 is not rejected

CAIML talk by Tanujit Chakraborty 67

Indian Statistical Institute

TEST OF HYPOTHESIS

Hypothesis Testing: Steps

1. Formulate the null hypothesis H0 and the alternative hypothesis H1
2. Select an appropriate statistical test and the corresponding test statistic
3. Choose level of significance alpha (generally taken as 0.05)
4. Collect data and calculate the value of test statistic
5. Determine the probability associated with the test statistic under the null
hypothesis using sampling distribution of the test statistic
6. Compare the probability associated with the test statistic with level of
significance specified

CAIML talk by Tanujit Chakraborty 68

Indian Statistical Institute

TEST OF HYPOTHESIS

One sample t test

Exercise 1 : A company claims that on an average it takes only 40 hours or less

to process any purchase order. Based on the data given below, can you validate
the claim? The data is given in PO_Processing.csv

Reading data to mydata

> mydata = PO_Processing$Processing_Time

Performing one sample t test

> t.test(mydata, alternative = 'greater', mu = 40)

Statistics Value
t 3.7031
df 99
P value 0.0001753

CAIML talk by Tanujit Chakraborty 69

Indian Statistical Institute

6. NORMALITY TEST

CAIML talk by Tanujit Chakraborty 70

Indian Statistical Institute

NORMALITY TEST

Normality test
A methodology to check whether the characteristic under study is normally
distributed or not
Two Methods :

Normality test - Quantile – Quantile (Q- Q) plot

Plots the ranked samples from the given distribution against a similar number of
ranked quantiles taken from a normal distribution
If the sample is normally distributed then the line will be straight in the plot

Normality test – Shapiro – Wilk test

H0: Deviation from bell shape (normality) = 0
H1 : Deviation from bell shape  0
If p value  0.05 (5%), then H0 is not rejected, distribution is normal
CAIML talk by Tanujit Chakraborty 71
Indian Statistical Institute

NORMALITY TEST

Normality test

Exercise 1 : The processing times of purchase orders is given in

PO_Processing.csv. Is the distribution of processing time is normally distributed?

Normality Check using Normal Q – Q plot

> qqnorm(PT)
> qqline(PT)

CAIML talk by Tanujit Chakraborty 72

Indian Statistical Institute

NORMALITY TEST

Normality test

Exercise 1 : The processing times of purchase orders is given in

PO_Processing.csv. Is the distribution of processing time is normally distributed?

Reading the data and variable

> mydata = PO_Processing
> PT = mydata$Processing_Time

Normality Check using Shapiro – Wilk test

> shapiro.test(PT)

Statistics Value
W 0.9804
p value 0.1418

Conclusion: The data is Normal if p-value is above 0.05

CAIML talk by Tanujit Chakraborty 73
Indian Statistical Institute

7. ANALYSIS OF VARIANCE

CAIML talk by Tanujit Chakraborty 74

Indian Statistical Institute

ANALYSIS OF VARIANCE

ANOVA

Analysis of Variance is a test of means for two or more populations

Partitions the total variability in the variable under study to different
components
H0 = Mean1 = Mean2 = - - - =Meank
Reject H0 if p – value < 0.05

Example:
To study location of shelf on sales revenue

CAIML talk by Tanujit Chakraborty 75

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

An electronics and home appliance chain suspect the location of shelves

where television sets are kept will influence the sales revenue. The data on
sales revenue in lakhs from the television sets when they are kept at
different locations inside the store are given in sales revenue data file. The
location is denoted as 1:front, 2: middle & 3: rear. Verify the doubt? The data
is given in Sales_Revenue_Anova.csv.

CAIML talk by Tanujit Chakraborty 76

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Factor: Location(A)
Levels : front, middle, rear
Response: Sales revenue

CAIML talk by Tanujit Chakraborty 77

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).

Level 1 Sum(A1):
Sum of all response values when location is at level 1 (front)
= 1.55 + 2.36 + 1.84 + 1.72
= 7.47
nA1: Number of response values with location is at level 1 (front)
=4

CAIML talk by Tanujit Chakraborty 78

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).

Level 1 Average:
Sum of all response values when location is at level 1 / number of response
values with location is at level 1
= A1 / nA1 = 7.47 / 4 = 1.87

CAIML talk by Tanujit Chakraborty 79

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).

Level 1 Level 2 Level 3

(front) (middle) (rear)
Sum A1: 7.47 A2: 30.31 A3: 15.55
Number nA1: 4 nA2: 8 nA3: 6
Average 1.87 3.79 2.59

CAIML talk by Tanujit Chakraborty 80

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 2: Calculate the grand total (T)

T = Sum of all the response values
= 1.55 + 2.36 + - - - + 2.72 + 2.07 = 53.33

Step 3: Calculate the total number of response values (N)

N = 18
Step 4: Calculate the Correction Factor (CF)
CF = (Grand Total)2 / Number of Response values
= T2 / N = (537.33)2 / 18 = 158.0049

CAIML talk by Tanujit Chakraborty 81

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 5: Calculate the Total Sum of Squares ( TSS)

TSS = Sum of square of all the response values - CF
= 1.552 + 2.362 + - - - + 2.722 + 2.072 – 158.0049
= 15.2182

CAIML talk by Tanujit Chakraborty 82

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 6: Calculate the between (factor) sum of square

SSA = A12 / nA1 + A22 / nA2 + A32 / nA3 - CF
= 7.472 / 4 + 30.312 / 8 + 15.552 / 4 – 158.0049
= 11.0827

Step 7: Calculate the within (error) sum of square

SSe = Total sum of square – between sum of square
= TSS - SSA = 15.2182 – 11.0827 = 4.1354

CAIML talk by Tanujit Chakraborty 83

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Step 8: Calculate degrees of freedom (df)

Total df = Total Number of response values - 1
= 18 - 1 = 17
Between df
= Number of levels of the factor - 1
=3-1=2
Within df = Total df – Between df
= 17 - 2 = 15

CAIML talk by Tanujit Chakraborty 84

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example

Anova Table:

Source df SS MS F F Crit P value

Between 2 11.08272 5.541358 20.09949 3.68 0.0000
Within 15 4.135446 0.275696
Total 17 15.21816

MS = SS / df
F = MSBetween/ MSWithin

F Crit =finv (probability, between df, within df ) , probability = 0.05

P value = fdist ( F, between df, within df)
CAIML talk by Tanujit Chakraborty 85
Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : R Code

Reading data and variables to R

> mydata = Sales_Revenue_Anova
> location = mydata$Location
> revenue = mydata$Sales.Revenue

Converting location to factor

> location = factor(location)
Computing ANOVA table
> fit = aov(Revenue ~ location)
> summary(fit)

CAIML talk by Tanujit Chakraborty 86

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Decision Rule

If p value < 0.05, then

The factor has significant effect on the process output or response.

Meaning:
When the factor is changed from 1 level to another level,
there will be significant change in the response.

CAIML talk by Tanujit Chakraborty 87

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

For factor Location, p = 0.000 < 0.05

Conclusion:
Location has significant effect on sales revenue

Meaning:
The sales revenue is not same for different locations like front, middle &
rear

CAIML talk by Tanujit Chakraborty 88

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

The expected sales revenue for different location under study is equal to level
averages.

Location Expected Sales Revenue

Front 1.8675

Middle 3.78875
rear 2.591667
mm

> aggregate(Revenue ~ location, FUN = mean)

CAIML talk by Tanujit Chakraborty 89

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

> boxplot(Revenue ~ location)

CAIML talk by Tanujit Chakraborty 90

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Example Result

> library(gplots)
> plotmeans(Revenue ~ location)

CAIML talk by Tanujit Chakraborty 91

Indian Statistical Institute

ANALYSIS OF VARIANCE

One Way Anova : Tukey’s Honestly Significant Difference (HSD) Test

Used to do pair wise comparison between the levels of factors

R code
>TukeyHSD(fit)

Mean
Comparison Lower Upper p value
difference

2-1 1.92125 1.086067 2.756433 0.0000

3-1 0.724167 -0.15619 1.604527 0.1158

3-2 -1.19708 -1.93365 -0.46052 0.0020

CAIML talk by Tanujit Chakraborty 92

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic:

Two Types of Variations:

1. Variation within the level of a factor
2. Variation between the levels of factor

CAIML talk by Tanujit Chakraborty 93

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :
Variation between the level of a factor:
The effect of Factor.
Variation within the levels of a factor:
The inherent variation in the process or Process Error.
Location

Front Middle rear

1.34 3.20 2.30
Sales Revenue

1.89 2.81 1.91

1.35 4.52 1.40
2.07 4.40 1.48
2.41 4.75
3.06 5.19
3.42
9.80

CAIML talk by Tanujit Chakraborty 94

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :

If the variation between the levels of a factor is significantly higher than

the inherent variation
then the factor has significant effect on response

To check whether a factor is significant:

Compare variation between levels with variation within levels

CAIML talk by Tanujit Chakraborty 95

Indian Statistical Institute

ANALYSIS OF VARIANCE

Anova logic :

Measure of variation between levels: MS of the factor (MSbetween)

Measure of variation within levels: MS Error (MSwithin)

To check whether a factor is significant:

Compare MS of between with MS within
i.e. Calculate F = MSbetween / MSwithin
If F is very high, then the factor is significant.

CAIML talk by Tanujit Chakraborty 96

Indian Statistical Institute

ANALYSIS OF VARIANCE

Variation Within levels:

Ideally variation within all the levels should be same

To check whether variation within the levels are same or not

Do Bartlett’s test
If p value ≥ 0.05, then variation within the levels are equal, otherwise
not

R Code for Bartlett’s test

> bartlett.test(Revenue, location, data = mydata)

CAIML talk by Tanujit Chakraborty 97

Indian Statistical Institute

ANALYSIS OF VARIANCE

Variation Within levels:

Bartlett’s Test result for sales revenue (location of TV sets) example

Bartlett's K2 Statistic df p value
3.8325 2 0.1472

Since p value = 0.1472 > 0.05, the variance within the levels are equal

CAIML talk by Tanujit Chakraborty 98

Indian Statistical Institute

8. REGRESSION ANALYSIS

CAIML talk by Tanujit Chakraborty 99

Indian Statistical Institute
REGRESSION ANALYSIS

Regression

Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship

Regression helps
•To identify the exact form of the relationship
•To model output in terms of input or process variables

CAIML talk by Tanujit Chakraborty 100

Indian Statistical Institute
REGRESSION ANALYSIS

Exercise 1:The data from the pulp drying process is given in the file
DC_Simple_Reg.csv. The file contains data on the dry content achieved
at different dryer temperature. Develop a prediction model for dry content
in terms of dryer temperature.

1. Reading the data and variables

> mydata = DC_Simple_Reg
> Temp = mydata$Dryer.Temperature
> DContent = mydata$Dry.Content

CAIML talk by Tanujit Chakraborty 101

Indian Statistical Institute
REGRESSION ANALYSIS

2. Constructing Scatter Plot

> plot(Temp, DContent)

CAIML talk by Tanujit Chakraborty 102

Indian Statistical Institute
REGRESSION ANALYSIS

3. Computing Correlation Matrix

> cor(Temp, DContent)

Attribute Dry Content

Temperature 0.9992

Remark:
Correlation between y & x need to be high (preferably 0.8 to 1 to -0.8 to -
1.0)

CAIML talk by Tanujit Chakraborty 103

Indian Statistical Institute
REGRESSION ANALYSIS

4: Performing Regression
> model = lm(DContent ~ Temp)
> summary(model)
p
Statistic Value Criteria Model df F
value
Residual standard 2449
0.07059 Regression 1 0.000
error 7
0.9984 > 0.6
Multiple R-squared Residual 40
> 0.6
Adjusted R-squared 0.9983 Total 41

Criteria:
P value < 0.05

CAIML talk by Tanujit Chakraborty 104

Indian Statistical Institute
REGRESSION ANALYSIS

4: Performing Regression

Attribute Coefficient Std. Error t Statistic p value

Intercept 2.183813 0.463589 4.711 0.00

Temperature
1.293432 0.008264 156.518 0.00

Interpretation
The p value for independent variable need to be < significance level  (generally
 = 0.05)

Model: Dry Content = 2.183813 + 1.293432 x Temperature

CAIML talk by Tanujit Chakraborty 105

Indian Statistical Institute
REGRESSION ANALYSIS

5: Regression Anova
> anova(model)

ANOVA
Source SS df MS F p value
Temp 122.057 1 122.057 24497 0.000
Residual 0.199 40 0.005
Total 122.256 41

Criteria: P value < 0.05

CAIML talk by Tanujit Chakraborty 106

Indian Statistical Institute
REGRESSION ANALYSIS

5: Residual Analysis
> pred = fitted(model)
> Res = residuals(model)
> write.csv(pred,"D:/ISI/DataSets/Pred.csv")
> write.csv(Res,"D:/ISI/DataSets/Res.csv")
SL No. Fitted Residuals SL No. Fitted Residuals
1 73.32259 -0.02259 22 74.61602 -0.01602
2 74.61602 -0.01602 23 75.26274 -0.06274
3 73.96931 0.030693 24 73.96931 0.030693
4 78.49632 0.00368 25 75.90946 -0.00946
5 74.61602 -0.01602 26 75.26274 0.03726
6 73.96931 0.030693 27 73.96931 0.030693
7 75.26274 -0.06274 28 78.49632 0.00368
8 77.20289 -0.00289 29 76.55617 -0.05617
9 75.90946 -0.00946 30 74.61602 -0.11602
10 74.61602 -0.01602 31 75.90946 0.090544
11 73.32259 -0.02259 32 76.55617 -0.05617
12 75.90946 -0.00946 33 76.55617 0.143828
13 75.90946 0.090544 34 75.90946 0.090544
14 74.61602 -0.01602 35 75.90946 -0.10946
15 74.61602 0.083977 36 73.96931 -0.16931
16 74.61602 -0.11602 37 73.32259 -0.02259
17 70.73573 -0.03573 38 74.61602 -0.01602
18 72.02916 -0.02916 39 73.32259 0.077409
19 72.02916 0.070841 40 75.90946 0.090544
20 72.02916 0.170841 41 73.96931 0.030693 CAIML talk by Tanujit Chakraborty 107
21 70.73573 -0.03573 42 75.26274 -0.06274
Indian Statistical Institute
REGRESSION ANALYSIS

5: Residual Analysis
Scatter Plot: Actual Vs Predicted (fit)
> plot(DContent, pred)

CAIML talk by Tanujit Chakraborty 108

Indian Statistical Institute
REGRESSION ANALYSIS

5: Residual Analysis
Normality Check on residuals
> qqnorm(Res)
> qqline(Res)

Residuals should be normally distributed or bell shaped

CAIML talk by Tanujit Chakraborty 109
Indian Statistical Institute
REGRESSION ANALYSIS

5: Residual Analysis
Normality Check on residuals
> shapiro.test(Res)

Shapiro-Wilk normality Test:

W p value
0.9693 0.3132

Residuals should be normally distributed or bell shaped

CAIML talk by Tanujit Chakraborty 110

Indian Statistical Institute
REGRESSION ANALYSIS

5: Residual Analysis
> plot(pred, Res)
> plot(Temp, Res)

Residuals should be independent and stable

Plot the residuals against fitted value. The points in the graph should be scattered
randomly and should not show any trend or pattern. The residuals should not
depend in anyway on the fitted value.

If there is a pattern then a transformation such as log y or y to be used

Similarly the residuals shall not depend on x. This can be checked by plotting
residuals vs x. A pattern in this plot is an indication that the residuals are not
independent of x. Instead of x, develop the model with a function of as predictor
(Eg: x2, 1/x, x, log(x), etc.)

CAIML talk by Tanujit Chakraborty 111

Indian Statistical Institute
REGRESSION ANALYSIS

Residual Analysis

There is no trend or pattern on residuals vs fitted value ,residuals vs observation order or

residuals vs x plot. Hence the assumptions of independence and stability of residuals are
satisfied.

CAIML talk by Tanujit Chakraborty 112

Indian Statistical Institute
REGRESSION ANALYSIS

6: Outlier test
Observations with Bonferonni p – value < 0.05 are potential outliers
> library(car)
> outlierTest(model)

Observation Studentized Residual Bonferonni p value

20 2.723093 0.40417

CAIML talk by Tanujit Chakraborty 113

Indian Statistical Institute
REGRESSION ANALYSIS

7: Leave One Out Cross Validation (LOOCV)

• Split the data into two parts : training data and test data
Test data consists of only one observation (x1, y1)
Training data consists of the remaining n – 1 observations namely (x2,
y2) , (x3, y3) , - - -, (xn, yn)
• Develop the model using n – 1 training data observations and predict
the response y1 of the test data observation
Compute the residuals and mean square error MSE1 = (y1actual – y1pred)2
• Repeat the process by taking (x1, y1) as test data and the remaining n –
1 observations as training data
• Compute MSE2
• Repeating the procedure n times produces n squared errors MSE1,
MSE2, - - -, MSEn
• LOOCV estimate of the test MSE is the average of these n test error
estimates
1 n
CV( n )   MSEi CAIML talk by Tanujit Chakraborty 114
n i 1
Indian Statistical Institute
REGRESSION ANALYSIS

7: Leave One Out Cross Validation (LOOCV)

> library(boot)
> attach(mydata)
> mymodel = glm(Dry.Content ~ Dryer.Temperature)
> valid = cv.glm(mydata, mymodel)
> valid$delta[1]

Statistic Value
Delta 0.005201004

CAIML talk by Tanujit Chakraborty 115

Indian Statistical Institute
REGRESSION ANALYSIS

Multiple Linear Regression

To model output variable y in terms of two or more variables.

General Form:
y = a + b1x1 + b2x2 + - - - + bkxk + 
Two variable case:
y = a + b1x1 + b2x2 + 
Where
a: intercept (the predicted value of y when all x’s are zero)
bj: slope (the amount change in y for unit change in xj keeping all other x’s constant, j =
1,2,---,k)

CAIML talk by Tanujit Chakraborty 116

Indian Statistical Institute
REGRESSION ANALYSIS

Exercise : The effect of temperature and reaction time affects the % yield. The data
collected in given in the Mult-Reg_Yield file. Develop a model for % yield
in terms of temperature and time?

Step 1: Correlation Analysis

Attribute Time Temperature % Yield

Time 1.00 -0.01 0.90

Temperature -0.01 1.00 -0.05

% Yield 0.90 -0.05 1.00

Correlation between xs & y should be high

Correlation between xs should be low
CAIML talk by Tanujit Chakraborty 117
Indian Statistical Institute
REGRESSION ANALYSIS

Step 2: Regression Output

Statistic Value Criteria

Adjusted R Square 0.7766 ≥ 0.6

Regression ANOVA
Model SS df MS F p value
Regression 6797.063 2 3398.531 27.07 0.0000
Residual 1632.08138 13 125.5447
Total 8429.14438 15

Criteria: P value < 0.05

CAIML talk by Tanujit Chakraborty 118

Indian Statistical Institute
REGRESSION ANALYSIS

Step 2: Regression Output

ANOVA
Source SS df MS F p value
Time 6777.8 1 6777.8 53.9872 0.000
Temp 19.3 1 19.3 0.1534 0.702
Residual 1632.1 13 125.5

Criteria: P value < 0.05

CAIML talk by Tanujit Chakraborty 119

Indian Statistical Institute
REGRESSION ANALYSIS

Step 2: Regression Output – Identify the model

Attribute Coefficient Std. Error t Statistic p value

Time 0.9061 0.12337 7.344 0.0000

Temperature -0.0642 0.16391 -0.392 0.702

Intercept -67.8844 40.58652 -1.67 0.118

Interpretation: Only time is related to % yield as p value < 0.05

CAIML talk by Tanujit Chakraborty 120

Indian Statistical Institute
REGRESSION ANALYSIS

Step 2: Regression Output – Identify the model

Attribute Coefficient Std. Error t Statistic p value

Time 0.9065 0.1196 7.580 0.0000

Intercept -81.6205 19.7906 -4.124 0.00103

Model % Yield= 0.9065 x Time - 81.621

CAIML talk by Tanujit Chakraborty 121

Indian Statistical Institute
REGRESSION ANALYSIS

Step 3: Residual Analysis

SL No. Temperature % Yield Predicted Time
1 190 35.0 36.22 130
2 176 81.7 76.10 174
3 205 42.5 39.84 134
4 210 98.3 91.51 191
5 230 52.7 67.94 165
6 192 82.0 94.23 194
7 220 34.5 48.00 143
8 235 95.4 86.98 186
9 240 56.7 44.38 139
10 230 84.4 88.79 188
11 200 94.3 77.01 175
12 218 44.3 59.79 156
13 220 83.3 90.61 190
14 210 91.4 79.73 178
15 208 43.5 38.03 132
16 225 51.7 52.53 148
122
CAIML talk by Tanujit Chakraborty
Indian Statistical Institute
REGRESSION ANALYSIS

Step 3: Residual Analysis: Outlier detection

SL No. Temperature % Yield Predicted Time Residuals Std Residuals

1 190 35 36.22 130 -1.22 -0.126

2 176 81.7 76.1 174 5.60 0.5358
3 205 42.5 39.84 134 2.66 0.2686
4 210 98.3 91.51 191 6.79 0.6784
5 230 52.7 67.94 165 -15.24 -1.45
6 192 82 94.23 194 -12.23 -1.238
7 220 34.5 48 143 -13.50 -1.322
8 235 95.4 86.98 186 8.42 0.8272
9 240 56.7 44.38 139 12.32 1.2221
10 230 84.4 88.79 188 -4.39 -0.434
11 200 94.3 77.01 175 17.29 1.6575
12 218 44.3 59.79 156 -15.49 -1.479
13 220 83.3 90.61 190 -7.31 -0.727
14 210 91.4 79.73 178 11.67 1.1244
15 208 43.5 38.03 132 5.47 0.5582
16 225 51.7 52.53 148 -0.83 -0.081
Mean 0.000
SD 10.4918

CAIML talk by Tanujit Chakraborty 123

Indian Statistical Institute
REGRESSION ANALYSIS

Step 3: Residual Analysis:

Shapiro-Wilk normality Test: Yield data

W p value
0.9449 0.4132

CAIML talk by Tanujit Chakraborty 124

Indian Statistical Institute
REGRESSION ANALYSIS

6: Outlier test
Observations with Bonferonni p – value < 0.05 are potential outliers
> library(car)
> outlierTest(mymodel)

Observation Studentized Residual Bonferonni p value

11 1.781515 NA

CAIML talk by Tanujit Chakraborty 125

Indian Statistical Institute
REGRESSION ANALYSIS

7: Leave One Out Cross Validation (LOOCV)

> library(boot)
> attach(mydata)
> mymodel = glm(X.Yield ~ Time)
> myvalidation = cv.glm(mydata, mymodel)
> myvalidation$delta[1]

Statistic Value
Delta 128.8541

CAIML talk by Tanujit Chakraborty 126

Indian Statistical Institute
REGRESSION ANALYSIS

Exercise : The effect of temperature, time and kappa number of pulp affects the %
conversion of UB pulp to Cl2 pulp. inspection. The data collected in given
in the Mult_Reg_Conversion file. Develop a model for % conversion in
terms of exploratory variables?

CAIML talk by Tanujit Chakraborty 127

Indian Statistical Institute
REGRESSION ANALYSIS

Step 1: Correlation Analysis

Temperature Time Kappa # % Conversion

Temperature 1.00 -0.96 0.22 0.95

Time -0.96 1.00 -0.24 -0.91

Kappa # 0.22 -0.24 1.00 0.37

% Conversion 0.95 -0.91 0.37 1.00

Interpretation
High Correlation between % Conversion and Temperature & Time
High Correlation between Temperature & Time - Multicollinearity

CAIML talk by Tanujit Chakraborty 128

Indian Statistical Institute
REGRESSION ANALYSIS

Measure for Multicollinearity

Variance Inflation Factor (VIF)

Measures the correlation (linear association) between each x variable with
other x’s

VIFi = 1/(1- Ri2)

Where Ri is the coefficient for regressing xi on other x’s

Criteria: VIF > 5 indicates multicollinearity.

CAIML talk by Tanujit Chakraborty 129

Indian Statistical Institute
REGRESSION ANALYSIS

Regression Output

Statistic Value Criteria

Adjusted R Square 0.899 > 0.6

Regression ANOVA
Model SS df MS F p value
Regression 1953.419 3 651.140 45.885 0.0000
Residual 170.290 12 14.191
Total 2123.709 15

CAIML talk by Tanujit Chakraborty 130

Indian Statistical Institute
REGRESSION ANALYSIS

Regression Output

Coeff Std. Error t p value

Constant -121.27 55.43571 -2.19 0.0492
Temperature 0.12685 0.04218 3.007 0.0109
Time -19.0217 107.92824 -0.18 0.863
Kappa # 0.34816 0.17702 1.967 0.0728

Variance-inflation factors (VIF)

> vif(mymodel)
x VIF
Temperature 12.23
Time 12.33
Kappa # 1.062 CAIML talk by Tanujit Chakraborty 131
Indian Statistical Institute
REGRESSION ANALYSIS

Tackling Multicollinearity:
1. Remove one or more of highly correlated independent variable
2. Principal Component Regression
3. Partial Least Square Regression
4. Ridge Regression

CAIML talk by Tanujit Chakraborty 132

Indian Statistical Institute
REGRESSION ANALYSIS

Tackling Multicollinearity:

Method 1: Removing highly correlated variable – Stepwise Regression

Approach
• A null model is developed without any predictor variable x. In null model, the
predicted value will be the overall mean of y
• Then predictor variables x’s are added to the model sequentially
• After adding each new variable, the method also remove any variable that no
longer provide an improvement in the model fit
• Finally the best model is identified as the one which minimizes Akaike
information criterion (AIC)

1
AIC  ( RSS  2dˆ2
nˆ 2

CAIML talk by Tanujit Chakraborty 133

Indian Statistical Institute
REGRESSION ANALYSIS

Tackling Multicollinearity:

Method 1: Removing highly correlated variable – Stepwise Regression

Akaike information criterion (AIC)

1
AIC  ( RSS  2dˆ 2)
nˆ 2
n: number of observations
ˆ 2 : estimate of error or residual variance
d: number of x variables included in the model
RSS: Residual sum of squares

CAIML talk by Tanujit Chakraborty 134

Indian Statistical Institute
REGRESSION ANALYSIS

Tackling Multicollinearity:

Method 1: Removing highly correlated variable – Stepwise Regression

R code
> library(MASS)
> mymodel = lm(X..Conversion ~ Temperature + Time + Kappa.number)
> step =stepAIC(mymodel, direction = "both")

Step x’s in the model AIC

1 Temperature, Time & Kappa Number 45.8
2 Temperature & Kappa Number 43.9

CAIML talk by Tanujit Chakraborty 135

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 1: Stepwise Regression

Attribute Coefficient Std. Error t Statistic p value

Temperature 0.13396 0.01191 11.250 0.0000
Kappa # 0.35106 0.16955 2.071 0.0589
Intercept -130.68986 14.14571 -9.239 0.0000

% Conversion = 0.13396 * Temperature + 0.35106 * Kappa # - 130.68986

Variance-inflation factors (VIF)
x VIF
Temperature 1.0526
Kappa # 1.0526
CAIML talk by Tanujit Chakraborty 136
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 1: Stepwise Regression

> pred = predict(mymodel)
> res = residuals(mymodel)
> cbind(X..Conversion, pred, res)
> mse = mean(res^2)
> rmse = sqrt(mse)

Statistic Value
Mean Square Error (MSE) 10.7
Root Mean Square Error (RMSE) 3.27

CAIML talk by Tanujit Chakraborty 137

Indian Statistical Institute

REGRESSION ANALYSIS

k fold Cross Validation

Steps
1. Divide the data set into k equal subsets
2. Keep one subset (sample) for model validation
3. Develop the model using all the other k – 1 subsets data put together
4. Predict the responses for the test data and compute residuals
5. Return the test sample back to the original data set and take another
subset for model validation
6. Go to step 3 and continue until all the subsets are tested with different
models
7. Compute the overall Root Mean Square Residuals. RMSE of validation
should not be high compared to the original model developed with all the
data points together.

Note: when k = n, then k fold cross validation is same as leave one out cross
validation

CAIML talk by Tanujit Chakraborty 138

Indian Statistical Institute

REGRESSION ANALYSIS

k fold Cross Validation

R code
> library(DAAG)
> cv.lm(mymodel, m = 16)
> cv.lm(mymodel, df = mydata, m = 16)

m: number of validations required. M = 16 = n, hence equal to leave one out

cross validation

Model MSE RMSE

Original 10.7 3.27
Cross Validation 19.6 4.43

CAIML talk by Tanujit Chakraborty 139

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 2: Principal Component Regression

1. Perform principal component analysis on x variables

2. Use the principal components as x variables and develop the model

R Code : Principal Component Regression

> mydata = mydata[,2:5]
> attach(mydata)
> library(pls)
> mymodel = pcr(X..Conversion ~ ., data = mydata, scale = TRUE)
> summary(mymodel)
> mymodel$loadings

CAIML talk by Tanujit Chakraborty 140

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 2: Principal Component Regression

1. Perform principal component analysis on x variables

2. Use the principal components as x variables and develop the model

Cum % Variance PC1 PC2 PC3

x 68.66 98.61 100
Conversion (y) 90.48 90.62 91.98

Component 1 or 1 & 2 may be sufficient to include in the model

CAIML talk by Tanujit Chakraborty 141

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 2: Principal Component Regression

1. Perform principal component analysis on x variables

2. Use the principal components as x variables and develop the model

Loadings PC1 PC2 PC3

Temperature -0.674 0.218 0.705

Time 0.677 -0.2 0.709

Kappa.number -0.296 -0.955 0

Component 1 is taking care of information in temperature and Time and

Component 2 is mostly representing kappa number

CAIML talk by Tanujit Chakraborty 142

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 2: Principal Component Regression

Principal Component Scores

SL No. Comp 1 Comp 2 Comp 3
1 -1.079 1.2498 0.1202
2 -1.158 0.9967 0.1236
3 -1.273 0.6625 0.117
4 -1.371 0.2313 0.1563
5 -1.543 -0.362 0.1756
6 -1.889 -1.365 0.1558
7 0.4709 1.1733 -0.133
8 0.3133 0.8148 -0.173
9 0.0021 0.2622 -0.299
10 -0.257 -0.122 -0.428
11 -0.268 -0.763 -0.24
12 -0.432 -1.819 -0.07
13 2.2484 0.6246 -0.022
14 2.4329 0.165 0.2963
15 2.1218 -0.388 0.1699
16 1.6801 -1.362 0.0493
CAIML talk by Tanujit Chakraborty 143
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 2: Principal Component Regression

Identifying the required number of components in the model

> pred = predict(mymodel, type = "response", ncomp = 1)
> res = X..Conversion - pred
> mse = mean(res^2)
> prednew = predict(mymodel, type = "response", ncomp = 2)
> resnew = X..Conversion - prednew
> msenew = mean(resnew^2)
Statistics Regression with
PC1 PC1 & PC2
MSE 12.64226 12.45593

Since there is not much reduction in MSE by including the second

principal component , only PC1 is required for modelling

CAIML talk by Tanujit Chakraborty 144

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 3: Partial Least Square Regression

Principal component regression involves the identification of a linear
combinations of predictors that best represents the x variables
The response y is not used to help the determination of principal components
The response does not supervise the identification of principal components
Identifies the best linear combinations which best explains the predictor
variables x but may not the ones best for predicting the response y

Partial least square regression is a supervised alternative to principal

component regression
Partial least square method identifies the components or directions (linear
combinations of x variables) using the response variable y.
Partial least square places highest weight on the variables that are most
strongly related the response y
CAIML talk by Tanujit Chakraborty 145
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 3: Partial Least Square Regression

R code
> mydata = mydata[,2:5]
> attach(mydata)
> library(pls)
> mymodel = plsr(X..Conversion ~ ., data = mydata, scale = TRUE)
> summary(mymodel)
> mymodel$loading

CAIML talk by Tanujit Chakraborty 146

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 3: Partial Least Square Regression

Cum % Variance PLS1 PLS2 PLS3

x 68.65 96.92 100
Conversion (y) 90.63 90.86 91.98

Loadings PLS1 PLS2 PLS3

Temperature 0.677 0.344 0.299
Time -0.679 -0.207 0.607
Kappa.number 0.285 -1.391 0.736

CAIML talk by Tanujit Chakraborty 147

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 3: Partial Least Square Regression

> ps = mymodel$scores
> score = ps[,1:2]
SL No PLS1 PLS2
1 1.11324 0.89634
2 1.18502 0.73368
3 1.2913 0.51027
4 1.3792 0.25877
5 1.5361 -0.1142
6 1.85493 -0.7845
7 -0.4425 0.66627
8 -0.2949 0.40157
9 -0.0005 -0.0564
10 0.24599 -0.4059
11 0.24426 -0.6809
12 0.3833 -1.24
13 -2.2314 0.4067
14 -2.4222 0.35105
15 -2.1279 -0.1069
16 -1.7138 -0.8359 CAIML talk by Tanujit Chakraborty 148
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 3: Partial least square regression

Identifying the required number of components in the model

> pred = predict(mymodel, data = mydata, scale = TRUE, ncomp = 1)
> res = X..Conversion - pred
> mse = mean(res^2)

> prednew = predict(mymodel, , data = mydata, scale = TRUE , ncomp = 2)

> resnew = X..Conversion - prednew
> msenew = mean(resnew^2)
Statistics Regression with
PLS1 PLS11 & PLS2
MSE 12.44252 12.13185

Since there is not much reduction in MSE by including the second

component , only PLS1 is required for modelling
CAIML talk by Tanujit Chakraborty 149
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 4: Ridge regression

In least square regression, the coefficients ’s of x variables are identified by

minimizing the residual sum of squares
n p
RSS   ( yi   0    j xij ) 2
i 1 j 1

In ridge regression, the coefficients ’s of x variables are identified by

minimizing a slightly different quantity
n p p p

(y
i 1
i   0    j xij )      RSS     j2
j 1
2

j 1
2
j
j 1

p
Where  ≥ 0 is a turning parameter and    j2 is the shrinkage penalty,
j 1

which will be small when 1, 2, - - -, p are close to zero.

CAIML talk by Tanujit Chakraborty 150
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 4: Ridge regression

Ridge regression seeks coefficient estimates that fit the data well by
minimizing the RSS and the tuning parameter  has the effect of shrinking the
estimates j towards zero
The value of  s identified through 10 fold cross validation

10 fold Cross Validation

• Divide the data set into 10 equal parts
• Develop the model using 9 parts and test it with the remaining one part
• Repeat the process 10 times to get an unbiased estimate of MSE

CAIML talk by Tanujit Chakraborty 151

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 4: Ridge regression

R Code
> library(glmnet)
> set.seed(1)
> y = mydata[,5]]
> x =mydata[,2:4]
> x = as.matrix(x)

Cross Validation
> mymodel = cv.glmnet(x , y, alpha =0)
> plot(mymodel)

CAIML talk by Tanujit Chakraborty 152

Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 4: Ridge regression

Choose the  which minimizes the mean square error

> bestlambda = mymodel$lambda.min
Best = 1.088771
CAIML talk by Tanujit Chakraborty 153
Indian Statistical Institute

REGRESSION ANALYSIS

Tackling Multi collinearity:

Method 4: Ridge regression

Develop the model with best  and identify the coefficients

> mynewmodel = glmnet(x, y, alpha = 0)
> predict (mynewmodel, type = "coefficients", s = bestlambda)[1:4,]

Variable Coefficients

(Intercept) -63.0713

Temperature 0.0823
Time -117.5048

Kappa.number 0.3268

CAIML talk by Tanujit Chakraborty 154

Indian Statistical Institute
CORRELATION & REGRESSION

Regression with dummy variables

When x’s are not numeric but nominal

Each nominal or categorical variable is converted into dummy variables
Dummy variables takes values 0 or 1
Number of dummy variable for one x variable is equal to number of distinct
values of that variable - 1

Example: A study was conducted to measure the effect of gender and income
on attitude towards vocation. Data was collected from 30 respondents and is
given in Travel_dummy_reg file. Attitude towards vocation is measured on a 9
point scale. Gender is coded as male = 1 and female = 2. Income is coded as
low=1, medium = 2 and high = 3. Develop a model for attitude towards
vocation in terms of gender and Income?

CAIML talk by Tanujit Chakraborty 155

Indian Statistical Institute
CORRELATION & REGRESSION

Regression with dummy variables

Variable Dummy

Gender Code gender_Code

Male 1 0

Female 2 1

Variable Dummy

Income Code Income1 Income 2

Low 1 0 0
Medium 2 1 0
High 3 0 1
CAIML talk by Tanujit Chakraborty 156
Indian Statistical Institute
CORRELATION & REGRESSION

Regression with dummy variables

Read the fie and variables

> mydata = Travel_dummy_Reg
> mydata = mydata[,2:4]
> gender = mydata$Gender
> Income = mydata$Income
> Attitude = mydata$Attitude

Converting categorical x’s to factors

> gender = factor(gender)
> income = factor(income)

CAIML talk by Tanujit Chakraborty 157

Indian Statistical Institute
CORRELATION & REGRESSION

Regression with dummy variables – Output

> mymodel = lm(attitude ~ genser + income)

> summary (mumodel)

Multiple R2 0.8603
Adjusted R2 0.8442
F Statistics 53.37
P value 0.00

Estimate Std. Error t value p value

(Intercept) 2.4 0.3359 7.145 0.00000
gender2 -1.6 0.3359 -4.763 0.00006
income2 2.8 0.4114 6.806 0.00000
income3 4.8 0.4114 11.668 0.00000

CAIML talk by Tanujit Chakraborty 158

Indian Statistical Institute
CORRELATION & REGRESSION

Regression with dummy variables – Output

> anova (mumodel)

Df Sum Sq Mean Sq F p value

gender 1 19.2 19.2 22.691 0.0001
income 2 116.27 58.133 68.703 0.0000
Residuals 26 22 0.846

CAIML talk by Tanujit Chakraborty 159

Indian Statistical Institute

9. BINARY LOGISTIC REGRESSION

CAIML talk by Tanujit Chakraborty 160

Indian Statistical Institute

BINARY LOGISTIC REGRESSION

Used to develop models when the output or response variable y is binary

The output variable will be binary, coded as either success or failure
Models probability of success p which lies between 0 and 1
Linear model is not appropriate

e a  b1 x1  b2 x2      bk xk
p
1  e a  b1 x1  b2 x2      bk xk

p: probability of success
xi’s : independent variables
a, b1, b2, ---: coefficients to be estimated

If estimate of p ≥ 0.5, then classified as success, otherwise as failure

CAIML talk by Tanujit Chakraborty 161

Indian Statistical Institute

BINARY LOGISTIC REGRESSION

Usage: When the dependant variable (Y variable) is binary

Example: Develop a model to predict the number of visits of family to a vacation

resort based on the salient characteristics of the families. The data collected
from 30 households is given in Resort_Visit.csv

1. Reading the file and variables

> mydata = Resort_Visit
> visit = mydata$Resort_Visit
> income = mydata$Family_Income
> attitude = mydata$Attitude.Towards.Travel
> importance = mydata$Importance_Vacation
> size = mydata$House_Size
> age = mydata$Age._Head

2. Converting response variable to discrete

> visit = factor(visit)
CAIML talk by Tanujit Chakraborty 162
162
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

3. Correlation Matrix
> cor(mydata)
Resort_Visit Family_Income Attitude_Travel Importance_Vacation House_Size Age_Head
Resort_Visit 1.00 -0.60 -0.27 -0.42 -0.59 -0.21
Family_Income -0.60 1.00 0.30 0.23 0.47 0.21
Attitude_Travel -0.27 0.30 1.00 0.19 0.15 -0.13
Importance_Vacation -0.42 0.23 0.19 1.00 0.30 0.11
House_Size -0.59 0.47 0.15 0.30 1.00 0.09
Age_Head -0.21 0.21 -0.13 0.11 0.09 1.00

Interpretation: Correlation between X variables should be low

CAIML talk by Tanujit Chakraborty 163

163
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

4. Converting response variable to discrete

> visit = factor(visit)

5. Checking relation between Xs and Y

> aggregate(income ~visit, FUN = mean)
> aggregate(attitude ~visit, FUN = mean)
> aggregate(importance ~visit, FUN = mean)
> aggregate(size ~visit, FUN = mean)
> aggregate(age ~visit, FUN = mean)

Mean
Resort_Visit
Family_Income Attitude_Travel Importance_Vacation House_Size Age_Head

0 58.5200 5.4000 5.8000 4.3333 53.7333

1 41.9133 4.3333 4.0667 2.8000 50.1333

Higher the difference in means, stronger will be the relation to response variable
CAIML talk by Tanujit Chakraborty 164
164
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

5. Checking relation between Xs and Y – box plot

> boxplot(income ~ visit)
> boxplot(attitude ~ visit)
> boxplot(importance ~ visit)
> boxplot(size ~ visit)
> boxplot(age ~ visit)

Income Vs visit

CAIML talk by Tanujit Chakraborty 165

165
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

6. Perform Logistic regression

> model = glm(visit ~ income + attitude + importance + size + age, family = binomial(logit))
> summary(model)

Estimate Std. Error z value Pr(>|z|)

(Intercept) 15.49503 6.68017 2.32 0.0204
Income -0.11698 0.06605 -1.771 0.0766
attitude -0.28129 0.33919 -0.829 0.4069
importance -0.46157 0.32006 -1.442 0.1493
size -0.80699 0.49314 -1.636 0.1018
age -0.07019 0.07199 -0.975 0.3295

CAIML talk by Tanujit Chakraborty 166

166
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

6. Perform Logistic regression - Anova

> anova(model,test = 'Chisq')> summary(model)

Df Deviance Resid.Df Resid.Dev Pr(>Chi)

NULL 29 41.589

income 1 12.9813 28 28.608 0.00031

attitude 1 0.4219 27 28.186 0.51598

importance 1 3.8344 26 24.351 0.05021

size 1 3.4398 25 20.911 0.06364

age 1 1.0242 24 19.887 0.31152

Since p value < 0.05 for Income, Importance_Vacation & Size, redo the
modelling with important factors only

CAIML talk by Tanujit Chakraborty 167

167
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

7. Perform Logistic regression - Modified

Estimate Std Error z value p value
(Intercept) 8.46599 3.02494 2.799 0.00513
Income -0.10641 0.05156 -2.064 0.03904
Size -0.93539 0.47632 -1.964 0.04955

Since p value < 0.05 for both factors, Income & Size, the response variable
can be modelled in terms of those two factors

The model is

e8.46599 0.10641 Annual_Income0.93539 Size

y
1  e8.46599 0.10641 Annual_Income0.93539 Size

CAIML talk by Tanujit Chakraborty 168

168
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

8. Conditional Density plots (Response Vs Factors)

Describing how the conditional distribution of a categorical variable y changes
over a numerical variable x
> cdplot(visit ~ income)
> cdplot(visit ~ size)

CAIML talk by Tanujit Chakraborty 169

169
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

9. Fitted Values and residuals
> predict(model,type = 'response')
> residuals(model,type = 'deviance')
> predclass = ifelse(predict(model, type ='response')>0.5,"1","0")
SL No. Actual Fitted Residuals Predicted Class SL No. Actual Fitted Residuals Predicted Class

1 0 0.970979 -2.66073 1 16 1 0.904132 0.448954 1

2 0 0.059732 -0.35097 0 17 1 0.939523 0.353222 1

3 0 0.021049 -0.20627 0 18 1 0.880611 0.50426 1

4 0 0.202309 -0.67236 0 19 1 0.345537 1.457845 0

5 0 0.292461 -0.83182 0 20 1 0.724535 0.802777 1

6 0 0.014893 -0.17324 0 21 1 0.925508 0.393479 1

7 0 0.677783 -1.50501 1 22 1 0.677559 0.882337 1

8 0 0.038723 -0.28105 0 23 1 0.680103 0.878079 1

9 0 0.109432 -0.48145 0 24 1 0.516151 1.150092 1

10 0 0.030543 -0.24908 0 25 1 0.680326 0.877704 1

11 0 0.017609 -0.1885 0 26 1 0.77062 0.721887 1

12 0 0.050856 -0.32309 0 27 1 0.629425 0.962235 1

13 0 0.04202 -0.29301 0 28 1 0.954395 0.305541 1

14 0 0.601981 -1.35739 1 29 1 0.841493 0.587498 1

15 0 0.499424 -1.17643 0 30 1 0.900286 0.45835 1

CAIML talk by Tanujit Chakraborty 170

170
Indian Statistical Institute

BINARY LOGISTIC REGRESSION

10. Model Evaluation

> mytable = table(visit, predclass)
> mytable
> prop.table(mytable)
Predicted Count Total
Actual Count 0 1
0 12 3 15
1 1 14 15
Total 13 17 30

Predicted % Total
Actual % 0 1
0 40 10 50
1 3 47 50
Total 43 50 100

Statistics Value
Accuracy % 87
Accuracy of  80 % is good
Error % 13

CAIML talk by Tanujit Chakraborty 171

171
Indian Statistical Institute

10. ORDINAL LOGISTIC REGRESSION

172
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Used to develop models when the output or response variable y is prdinall

The output variable will be categorical, having more than two categories

173
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Read the data file and variables

> dd = mydata$DD
> effort = mydata$Effort
> coverage = mydata$Test.Coverage

174
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Make one of the classes (say “Low”) of output variable as the baseline level
> library(MASS)
> mymodel = polr(dd ~ effort + coverage)
> summary(mymodel)

175
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

effort coverage
0.0234 0.0257

Intercepts

High | Low Low | Medium

1.4947 3.925

176
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Predicted values
> pred = predict(mymodel)
> fit = fitted(mymodel)
>fit
> output = cbind(dd, pred)
>write.csv(output, "E:/Infosys/Part 2/output.csv")

177
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Comparing Actual Vs Predicted
> mytable = table(dd, pred)
> mytable
> prop.table(mytable)

Predicted
High Low Medium
High 8 42 0
Actual
Low 0 105 0
Medium 1 44 0
178
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION

Predicted
High Low Medium
High 4.0 21.0 0.00
Actual
Low 0.00 52.50 0.00
Medium 0.50 22.0 0.00

Accuracy = 4 + 52.5 + 0.00 = 0.565 = 56.5%

179
For other queries mail me at [email protected]

THANK YOU
180 CAIML talk by Tanujit Chakraborty

Leed v4 Ebom
No ratings yet
Leed v4 Ebom
574 pages
I FHMHFM
No ratings yet
I FHMHFM
1 page
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
Introductory Lecture
No ratings yet
Introductory Lecture
29 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Data Science With Advanced Tableau Certification Course
No ratings yet
Data Science With Advanced Tableau Certification Course
9 pages
CH 3
No ratings yet
CH 3
33 pages
R Programmimg Practical Journal All-1
No ratings yet
R Programmimg Practical Journal All-1
25 pages
R Studio
No ratings yet
R Studio
8 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
R study material I
No ratings yet
R study material I
8 pages
R Console
No ratings yet
R Console
6 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
arunav da prac
No ratings yet
arunav da prac
55 pages
Singh_Project1_Report
No ratings yet
Singh_Project1_Report
12 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Report Stats PDF
No ratings yet
Report Stats PDF
23 pages
Quantitative Methods - I (Statistics)
No ratings yet
Quantitative Methods - I (Statistics)
30 pages
Practical File R by Komal
No ratings yet
Practical File R by Komal
26 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Bmsi Solved Past Papers April Updated
No ratings yet
Bmsi Solved Past Papers April Updated
69 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
DATA SCIENCE With DA, ML, DL, AI Using Python & R PDF
100% (1)
DATA SCIENCE With DA, ML, DL, AI Using Python & R PDF
10 pages
R Commands
No ratings yet
R Commands
18 pages
quiz3_ba
No ratings yet
quiz3_ba
1 page
Descriptive Statistics
No ratings yet
Descriptive Statistics
13 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Lab 02 - Compound Data Structures
No ratings yet
Lab 02 - Compound Data Structures
12 pages
Ivy - Data Analytics and Data Visualization Certification Course
No ratings yet
Ivy - Data Analytics and Data Visualization Certification Course
9 pages
Runit 1
No ratings yet
Runit 1
55 pages
R Basic
No ratings yet
R Basic
16 pages
as
No ratings yet
as
22 pages
Unit 4
No ratings yet
Unit 4
27 pages
Business Mathematics: Quarter 2, Week 9-Module 20 Organizing Business Data Using Microsoft Excel ABM - BM11PAD-Ilj-11
No ratings yet
Business Mathematics: Quarter 2, Week 9-Module 20 Organizing Business Data Using Microsoft Excel ABM - BM11PAD-Ilj-11
15 pages
Notes
No ratings yet
Notes
6 pages
Data SC With Data Visualization
No ratings yet
Data SC With Data Visualization
9 pages
Module 01 - STAT 101
No ratings yet
Module 01 - STAT 101
23 pages
3.business
No ratings yet
3.business
7 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Ivy - Data Science and Data Visualization Certification Course
100% (1)
Ivy - Data Science and Data Visualization Certification Course
10 pages
List of Programs in R 2 Sem
No ratings yet
List of Programs in R 2 Sem
48 pages
Document (9)
No ratings yet
Document (9)
8 pages
r file code
No ratings yet
r file code
16 pages
qam
No ratings yet
qam
3 pages
Slide-04-Chapter2-Getting to Know Your Data
No ratings yet
Slide-04-Chapter2-Getting to Know Your Data
47 pages
Data Basics for ML
No ratings yet
Data Basics for ML
23 pages
R
No ratings yet
R
13 pages
Unit_3 (1)
No ratings yet
Unit_3 (1)
36 pages
Module 1
No ratings yet
Module 1
64 pages
BA_Unit 4 (P2)
No ratings yet
BA_Unit 4 (P2)
17 pages
02Data
No ratings yet
02Data
65 pages
Mid Sem-1 - Faids
No ratings yet
Mid Sem-1 - Faids
2 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
Neodymium-Iron-Boron Magnet Grades PDF
No ratings yet
Neodymium-Iron-Boron Magnet Grades PDF
82 pages
VYOM Catalog
No ratings yet
VYOM Catalog
14 pages
Physics Syllabus 2026-2028
No ratings yet
Physics Syllabus 2026-2028
63 pages
Design and Analysis of Control Circuit For TCSC Facts Controller IJERTV2IS50704
No ratings yet
Design and Analysis of Control Circuit For TCSC Facts Controller IJERTV2IS50704
7 pages
NCERT Solutions For Class 7 Science Chapter 6 - Physical and Chemical Changes - .
No ratings yet
NCERT Solutions For Class 7 Science Chapter 6 - Physical and Chemical Changes - .
4 pages
Dicc Xii Chem 400 Mcqs
No ratings yet
Dicc Xii Chem 400 Mcqs
18 pages
Concresive 1300
No ratings yet
Concresive 1300
6 pages
0 - Model Paper 1A
No ratings yet
0 - Model Paper 1A
6 pages
OceanofPDF.com Lout of the Counts Family Volume 1 - Yu Ryeo-han
No ratings yet
OceanofPDF.com Lout of the Counts Family Volume 1 - Yu Ryeo-han
348 pages
Home Exclusives Suspension Problems: What To Fix and How
No ratings yet
Home Exclusives Suspension Problems: What To Fix and How
6 pages
cms61 User Guide
No ratings yet
cms61 User Guide
6 pages
Grade 10 ICT First Term Test Kengalla Maha Vidyala Paper 2019
No ratings yet
Grade 10 ICT First Term Test Kengalla Maha Vidyala Paper 2019
4 pages
Geas Final
No ratings yet
Geas Final
489 pages
Performance of Soybean in The Northern Guinea Savanna As Affected by Variety, Nitrogen and Potassium Fertilizer Levels
No ratings yet
Performance of Soybean in The Northern Guinea Savanna As Affected by Variety, Nitrogen and Potassium Fertilizer Levels
14 pages
Teset p4-p6 2563 - Exam
No ratings yet
Teset p4-p6 2563 - Exam
25 pages
Reading Asssignment 1
No ratings yet
Reading Asssignment 1
2 pages
Workout Program Advanced: Day 1 Reps Sets Rest Time Skills Skills Skills FBW FBW FBW FBW FBW FBW FBW FBW
No ratings yet
Workout Program Advanced: Day 1 Reps Sets Rest Time Skills Skills Skills FBW FBW FBW FBW FBW FBW FBW FBW
5 pages
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
100% (5)
Immediate Download Deep Learning in Bioinformatics: Techniques and Applications in Practice - Ebook PDF Ebooks 2024
41 pages
Grade 8 ICT Note To Be Send..
No ratings yet
Grade 8 ICT Note To Be Send..
7 pages
Distance Casting
No ratings yet
Distance Casting
4 pages
Josh Groban - Harmony
No ratings yet
Josh Groban - Harmony
11 pages
The Code of Robot
No ratings yet
The Code of Robot
10 pages
Test 7
No ratings yet
Test 7
6 pages
Prospectus Cum Application Form For Admission Test in Jawahar Navodaya Vidyalayas For Admission in Class Ix During 2020-21 Against Vacant Seats 1.0
No ratings yet
Prospectus Cum Application Form For Admission Test in Jawahar Navodaya Vidyalayas For Admission in Class Ix During 2020-21 Against Vacant Seats 1.0
59 pages
Mas List
No ratings yet
Mas List
8 pages
sensors-24-07546
No ratings yet
sensors-24-07546
18 pages
Friut 1-5 Table 2020-21
No ratings yet
Friut 1-5 Table 2020-21
10 pages
Godrej Meridian MLP
No ratings yet
Godrej Meridian MLP
1 page