tanujit_caiml (1)
tanujit_caiml (1)
Tanujit Chakraborty
PhD Scholar, Indian Statistical Institute, Kolkata.
Webpage : www.ctanujit.org
Mail : [email protected]
1 CAIML talk by Tanujit Chakraborty
Course Outline
Data Pre-processing
4 9 Binary Logistic Regression
Test of Hypothesis
5 10 Ordinal Logistic Regression
Functional Help
?rnorm()
Package Installation
install.packages("ggplot")
Library Call (for use)
library(ggplot) CAIML talk by Tanujit Chakraborty 6
2. DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
The monthly credit card expenses of an individual in 1000 rupees is given below.
Kindly summarize the data
Format Code
Excel library(xlsx)
mydata <- read.xlsx("c:/myexcel.xlsx", “Sheet1”)
DESCRIPTIVE STATISTICS
Operators - Arithmetic
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2
Operators - Logical
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if X is TRUE
Descriptive Statistics
Computation of descriptive statistics for variable CC
Descriptive Statistics
Function Code
Quantile > quantile(CC)
Output
Quantile 0% 25% 50% 75% 100%
Value 53 57 59 61 65
Function Code
Summary >summary(CC)
Output
Minimum Q1 Median Mean Q3 Maximum
53 57 59 59.2 61 65
CAIML talk by Tanujit Chakraborty 17
Indian Statistical Institute
DESCRIPTIVE STATISTICS
Descriptive Statistics
Function Code Output
Statistics Values
describe > library(psych)
> describe(CC) N 20
mean 59.2
sd 3.11
median 59
Trimmed 59.25
mad 2.97
min 53
Max 65
Range 12
Skew -0.08
Kurtosis -0.85
se 0.69
DESCRIPTIVE STATISTICS
Graphs
Graph Code
Histogram > hist(CC)
Histogram colour (“Blue”) > hist(CC,col="blue")
Dot plot > dotchart(CC)
Box plot > boxplot(CC)
Box plot colour > boxplot(CC, col="dark green")
DESCRIPTIVE STATISTICS
Histogram : Variable - CC
DESCRIPTIVE STATISTICS
3. DATA VISUALIZATION
Read data and simple scatter plot using function ggplot() with geom_point().
> train <-
read.csv("C:/Users/ISIUSER3/Desktop/CAIML_2019/Data/Big_Mart_Dataset.c
sv")
> view(train)
> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point() +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
CAIML talk by Tanujit Chakraborty 25
DATA VISUALIZATION Indian Statistical Institute
1. Scatter Plot: Now, we can view a third variable also in same chart, say a
categorical variable (Item_Type) which will give the characteristic (item_type)
of each data set. Different categories are depicted by way of different color for
item_type in below chart.
> library(ggplot2)
> ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color =
Item_Type)) + scale_x_continuous("Item Visibility", breaks =
seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by =
30))+ theme_bw() + labs(title="Scatterplot")
1. Scatter Plot: We can even make it more visually clear by creating separate
scatter plots for each separate Item_Type as shown below.
library(ggplot2)
2. Histogram: It is used to plot continuous variable. It breaks the data into bins
and shows frequency distribution of these bins. We can always change the bin
size and see the effect it has on visualization.
3. Stack Bar Chart: It is an advanced version of bar chart, used for visualizing
a combination of categorical variables.
R Code:
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill =
"red")+scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000,
by=500))+labs(title = "Box Plot", x = "Outlet Identifier")
The black points are outliers. Outlier detection and removal is an essential step of
successful data exploration.
CAIML talk by Tanujit Chakraborty 38
DATA VISUALIZATION Indian Statistical Institute
For Big_Mart_Dataset, when we want to analyse the trend of item outlet sales,
area chart can be plotted as shown below. It shows count of outlets on basis of
sales.
R Code:
> ggplot(train, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill
= "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title =
"Area Chart", x = "Item Outlet Sales", y = "Count")
Area chart shows continuity of Item Outlet Sales using function ggplot() with
geom_area.
CAIML talk by Tanujit Chakraborty 40
DATA VISUALIZATION Indian Statistical Institute
R Code:
The dark portion indicates Item MRP is close 50. The brighter portion indicates
Item MRP is close to 250.
CAIML talk by Tanujit Chakraborty 42
DATA VISUALIZATION Indian Statistical Institute
> install.packages("corrgram")
> library(corrgram)
> corrgram(train, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
CAIML talk by Tanujit Chakraborty 43
DATA VISUALIZATION Indian Statistical Institute
• Darker the colour, higher the co-relation between variables. Positive co-
relations are displayed in blue and negative correlations in red colour. Colour
intensity is proportional to the co-relation value.
• We can see that Item cost & Outlet sales are positively correlated while Item
weight & its visibility are negatively correlated.
44
CAIML talk by Tanujit Chakraborty
Indian Statistical Institute
4. DATA PRE-PROCESSING
Option 2: Replace the missing values with variable mean, median, etc
Option 2: Replace the missing values with variable mean, median, etc
Option 2: Replace the missing values with variable mean, median, etc
Replacing the missing values with men
SL No cmusage l3musage avrecharge Proj Growth Circle
1 5.1 3.5 99.4 11 1
2 4.9 3 98.6 11 1
3 5.975 3.2 96.14117647 11 1
4 4.6 3.1 98.5 1 1
5 5 3.105882353 98.4 11 1
6 5.4 3.9 98.3 12 1
7 7 3.2 95.3 6 2
8 6.4 3.2 95.5 7 2
9 6.9 3.1 95.1 7 2
10 5.975 2.3 96 5 2
11 6.5 2.8 95.4 7 2
12 5.7 3.105882353 95.5 5 2
13 6.3 3.3 96.14117647 8 2
14 6.7 3.3 94.3 3 3
15 6.7 3 94.8 2 3
16 6.3 2.5 95 10 3
17 5.975 3 94.8 4 3
18 6.2 3.4 94.6 2 3
19 5.9 3 94.9 9 3
TRANSFORMATION / NORMALIZATION
z transform:
Transformed data = (Data – Mean) / SD
RANDOM SAMPLING
Example: Take a sample of size 60 (10%) randomly from the data given in the
file bank-data.csv and save it as a new csv file?
>write.csv(mysample,"E:/ISI/mysample.csv")
RANDOM SAMPLING
Example: Split randomly the data given in the file bank-data.csv into sets namely
training (75%) and test (25%) ?
5. TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
Introduction:
In many situations, it is required to accept or reject a statement or claim about
some parameter
Example:
1. The average cycle time is less than 24 hours
2. The % rejection is only 1%
The statement is called the hypothesis
The procedure for decision making about the hypothesis is called hypothesis
testing
Advantages
1. Handles uncertainty in decision making
2. Minimizes subjectivity in decision making
3. Helps to validate assumptions or verify conclusions
CAIML talk by Tanujit Chakraborty 58
Indian Statistical Institute
TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
Null Hypothesis:
A statement about the status quo
One of no difference or no effect
Denoted by H0
Alternative Hypothesis:
One in which some difference or effect is expected
Denoted by H1
TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
Conclusion:
Difficult to say mean = specified value by looking at xbar - specified value
alone
CAIML talk by Tanujit Chakraborty 62
Indian Statistical Institute
TEST OF HYPOTHESIS
To check whether test statistic is close to 0, find out p value from the sampling
distribution of test statistic
TEST OF HYPOTHESIS
Methodology demo: To Test Mean = Specified Value
P value
The probability that such evidence or result will occur when H0 is true
Based on the reference distribution of test statistic
The tail area beyond the value of test statistic in reference distribution
4
1
P value
0
-0.2 0.1 0 1 2 3 4
t0 64
CAIML talk by Tanujit Chakraborty
Indian Statistical Institute
TEST OF HYPOTHESIS
Methodology demo : To Test Mean = Specified Value
P value
4
0
P value
-0.2 0.1 0 1 2 3 4
t0
If test statistic t0 is close to 0 then p will be high
If test statistic t0 is not close to 0 then p will be small
If p is small , p < 0.05 (with alpha = 0.05), conclude that t 0, then
Mean Specified Value, H0 rejected
CAIML talk by Tanujit Chakraborty 65
Indian Statistical Institute
TEST OF HYPOTHESIS
H0: Mean = 5
H1: Mean 5
TEST OF HYPOTHESIS
t0 = 0.5571
P = 0.59
2
0
-0.2 0.1 0 1 2 3 4
0.55
TEST OF HYPOTHESIS
TEST OF HYPOTHESIS
Statistics Value
t 3.7031
df 99
P value 0.0001753
6. NORMALITY TEST
NORMALITY TEST
Normality test
A methodology to check whether the characteristic under study is normally
distributed or not
Two Methods :
NORMALITY TEST
Normality test
NORMALITY TEST
Normality test
Statistics Value
W 0.9804
p value 0.1418
7. ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
ANOVA
Example:
To study location of shelf on sales revenue
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
Factor: Location(A)
Levels : front, middle, rear
Response: Sales revenue
ANALYSIS OF VARIANCE
Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).
Level 1 Sum(A1):
Sum of all response values when location is at level 1 (front)
= 1.55 + 2.36 + 1.84 + 1.72
= 7.47
nA1: Number of response values with location is at level 1 (front)
=4
ANALYSIS OF VARIANCE
Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).
Level 1 Average:
Sum of all response values when location is at level 1 / number of response
values with location is at level 1
= A1 / nA1 = 7.47 / 4 = 1.87
ANALYSIS OF VARIANCE
Step 1: Calculate the sum, average and number of response values for each
level of the factor (location).
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
Anova Table:
MS = SS / df
F = MSBetween/ MSWithin
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
Meaning:
When the factor is changed from 1 level to another level,
there will be significant change in the response.
ANALYSIS OF VARIANCE
Meaning:
The sales revenue is not same for different locations like front, middle &
rear
ANALYSIS OF VARIANCE
The expected sales revenue for different location under study is equal to level
averages.
Front 1.8675
Middle 3.78875
rear 2.591667
mm
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
> library(gplots)
> plotmeans(Revenue ~ location)
ANALYSIS OF VARIANCE
R code
>TukeyHSD(fit)
Mean
Comparison Lower Upper p value
difference
ANALYSIS OF VARIANCE
Anova logic:
ANALYSIS OF VARIANCE
Anova logic :
Variation between the level of a factor:
The effect of Factor.
Variation within the levels of a factor:
The inherent variation in the process or Process Error.
Location
ANALYSIS OF VARIANCE
Anova logic :
ANALYSIS OF VARIANCE
Anova logic :
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE
Since p value = 0.1472 > 0.05, the variance within the levels are equal
8. REGRESSION ANALYSIS
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
Regression helps
•To identify the exact form of the relationship
•To model output in terms of input or process variables
Exercise 1:The data from the pulp drying process is given in the file
DC_Simple_Reg.csv. The file contains data on the dry content achieved
at different dryer temperature. Develop a prediction model for dry content
in terms of dryer temperature.
Temperature 0.9992
Remark:
Correlation between y & x need to be high (preferably 0.8 to 1 to -0.8 to -
1.0)
4: Performing Regression
> model = lm(DContent ~ Temp)
> summary(model)
p
Statistic Value Criteria Model df F
value
Residual standard 2449
0.07059 Regression 1 0.000
error 7
0.9984 > 0.6
Multiple R-squared Residual 40
> 0.6
Adjusted R-squared 0.9983 Total 41
Criteria:
P value < 0.05
4: Performing Regression
Temperature
1.293432 0.008264 156.518 0.00
Interpretation
The p value for independent variable need to be < significance level (generally
= 0.05)
5: Regression Anova
> anova(model)
ANOVA
Source SS df MS F p value
Temp 122.057 1 122.057 24497 0.000
Residual 0.199 40 0.005
Total 122.256 41
5: Residual Analysis
> pred = fitted(model)
> Res = residuals(model)
> write.csv(pred,"D:/ISI/DataSets/Pred.csv")
> write.csv(Res,"D:/ISI/DataSets/Res.csv")
SL No. Fitted Residuals SL No. Fitted Residuals
1 73.32259 -0.02259 22 74.61602 -0.01602
2 74.61602 -0.01602 23 75.26274 -0.06274
3 73.96931 0.030693 24 73.96931 0.030693
4 78.49632 0.00368 25 75.90946 -0.00946
5 74.61602 -0.01602 26 75.26274 0.03726
6 73.96931 0.030693 27 73.96931 0.030693
7 75.26274 -0.06274 28 78.49632 0.00368
8 77.20289 -0.00289 29 76.55617 -0.05617
9 75.90946 -0.00946 30 74.61602 -0.11602
10 74.61602 -0.01602 31 75.90946 0.090544
11 73.32259 -0.02259 32 76.55617 -0.05617
12 75.90946 -0.00946 33 76.55617 0.143828
13 75.90946 0.090544 34 75.90946 0.090544
14 74.61602 -0.01602 35 75.90946 -0.10946
15 74.61602 0.083977 36 73.96931 -0.16931
16 74.61602 -0.11602 37 73.32259 -0.02259
17 70.73573 -0.03573 38 74.61602 -0.01602
18 72.02916 -0.02916 39 73.32259 0.077409
19 72.02916 0.070841 40 75.90946 0.090544
20 72.02916 0.170841 41 73.96931 0.030693 CAIML talk by Tanujit Chakraborty 107
21 70.73573 -0.03573 42 75.26274 -0.06274
Indian Statistical Institute
REGRESSION ANALYSIS
5: Residual Analysis
Scatter Plot: Actual Vs Predicted (fit)
> plot(DContent, pred)
5: Residual Analysis
Normality Check on residuals
> qqnorm(Res)
> qqline(Res)
5: Residual Analysis
Normality Check on residuals
> shapiro.test(Res)
5: Residual Analysis
> plot(pred, Res)
> plot(Temp, Res)
Similarly the residuals shall not depend on x. This can be checked by plotting
residuals vs x. A pattern in this plot is an indication that the residuals are not
independent of x. Instead of x, develop the model with a function of as predictor
(Eg: x2, 1/x, x, log(x), etc.)
Residual Analysis
6: Outlier test
Observations with Bonferonni p – value < 0.05 are potential outliers
> library(car)
> outlierTest(model)
Statistic Value
Delta 0.005201004
Exercise : The effect of temperature and reaction time affects the % yield. The data
collected in given in the Mult-Reg_Yield file. Develop a model for % yield
in terms of temperature and time?
Regression ANOVA
Model SS df MS F p value
Regression 6797.063 2 3398.531 27.07 0.0000
Residual 1632.08138 13 125.5447
Total 8429.14438 15
ANOVA
Source SS df MS F p value
Time 6777.8 1 6777.8 53.9872 0.000
Temp 19.3 1 19.3 0.1534 0.702
Residual 1632.1 13 125.5
6: Outlier test
Observations with Bonferonni p – value < 0.05 are potential outliers
> library(car)
> outlierTest(mymodel)
Statistic Value
Delta 128.8541
Exercise : The effect of temperature, time and kappa number of pulp affects the %
conversion of UB pulp to Cl2 pulp. inspection. The data collected in given
in the Mult_Reg_Conversion file. Develop a model for % conversion in
terms of exploratory variables?
Interpretation
High Correlation between % Conversion and Temperature & Time
High Correlation between Temperature & Time - Multicollinearity
Regression Output
Regression ANOVA
Model SS df MS F p value
Regression 1953.419 3 651.140 45.885 0.0000
Residual 170.290 12 14.191
Total 2123.709 15
Regression Output
Tackling Multicollinearity:
1. Remove one or more of highly correlated independent variable
2. Principal Component Regression
3. Partial Least Square Regression
4. Ridge Regression
Tackling Multicollinearity:
Approach
• A null model is developed without any predictor variable x. In null model, the
predicted value will be the overall mean of y
• Then predictor variables x’s are added to the model sequentially
• After adding each new variable, the method also remove any variable that no
longer provide an improvement in the model fit
• Finally the best model is identified as the one which minimizes Akaike
information criterion (AIC)
1
AIC ( RSS 2dˆ2
nˆ 2
Tackling Multicollinearity:
1
AIC ( RSS 2dˆ 2)
nˆ 2
n: number of observations
ˆ 2 : estimate of error or residual variance
d: number of x variables included in the model
RSS: Residual sum of squares
Tackling Multicollinearity:
R code
> library(MASS)
> mymodel = lm(X..Conversion ~ Temperature + Time + Kappa.number)
> step =stepAIC(mymodel, direction = "both")
REGRESSION ANALYSIS
REGRESSION ANALYSIS
Statistic Value
Mean Square Error (MSE) 10.7
Root Mean Square Error (RMSE) 3.27
REGRESSION ANALYSIS
Steps
1. Divide the data set into k equal subsets
2. Keep one subset (sample) for model validation
3. Develop the model using all the other k – 1 subsets data put together
4. Predict the responses for the test data and compute residuals
5. Return the test sample back to the original data set and take another
subset for model validation
6. Go to step 3 and continue until all the subsets are tested with different
models
7. Compute the overall Root Mean Square Residuals. RMSE of validation
should not be high compared to the original model developed with all the
data points together.
Note: when k = n, then k fold cross validation is same as leave one out cross
validation
REGRESSION ANALYSIS
R code
> library(DAAG)
> cv.lm(mymodel, m = 16)
> cv.lm(mymodel, df = mydata, m = 16)
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
REGRESSION ANALYSIS
(y
i 1
i 0 j xij ) RSS j2
j 1
2
j 1
2
j
j 1
p
Where ≥ 0 is a turning parameter and j2 is the shrinkage penalty,
j 1
REGRESSION ANALYSIS
Ridge regression seeks coefficient estimates that fit the data well by
minimizing the RSS and the tuning parameter has the effect of shrinking the
estimates j towards zero
The value of s identified through 10 fold cross validation
REGRESSION ANALYSIS
R Code
> library(glmnet)
> set.seed(1)
> y = mydata[,5]]
> x =mydata[,2:4]
> x = as.matrix(x)
Cross Validation
> mymodel = cv.glmnet(x , y, alpha =0)
> plot(mymodel)
REGRESSION ANALYSIS
REGRESSION ANALYSIS
Variable Coefficients
(Intercept) -63.0713
Temperature 0.0823
Time -117.5048
Kappa.number 0.3268
Example: A study was conducted to measure the effect of gender and income
on attitude towards vocation. Data was collected from 30 respondents and is
given in Travel_dummy_reg file. Attitude towards vocation is measured on a 9
point scale. Gender is coded as male = 1 and female = 2. Income is coded as
low=1, medium = 2 and high = 3. Develop a model for attitude towards
vocation in terms of gender and Income?
Variable Dummy
Male 1 0
Female 2 1
Variable Dummy
Multiple R2 0.8603
Adjusted R2 0.8442
F Statistics 53.37
P value 0.00
e a b1 x1 b2 x2 bk xk
p
1 e a b1 x1 b2 x2 bk xk
p: probability of success
xi’s : independent variables
a, b1, b2, ---: coefficients to be estimated
3. Correlation Matrix
> cor(mydata)
Resort_Visit Family_Income Attitude_Travel Importance_Vacation House_Size Age_Head
Resort_Visit 1.00 -0.60 -0.27 -0.42 -0.59 -0.21
Family_Income -0.60 1.00 0.30 0.23 0.47 0.21
Attitude_Travel -0.27 0.30 1.00 0.19 0.15 -0.13
Importance_Vacation -0.42 0.23 0.19 1.00 0.30 0.11
House_Size -0.59 0.47 0.15 0.30 1.00 0.09
Age_Head -0.21 0.21 -0.13 0.11 0.09 1.00
Mean
Resort_Visit
Family_Income Attitude_Travel Importance_Vacation House_Size Age_Head
Higher the difference in means, stronger will be the relation to response variable
CAIML talk by Tanujit Chakraborty 164
164
Indian Statistical Institute
Income Vs visit
NULL 29 41.589
Since p value < 0.05 for Income, Importance_Vacation & Size, redo the
modelling with important factors only
Since p value < 0.05 for both factors, Income & Size, the response variable
can be modelled in terms of those two factors
The model is
Predicted % Total
Actual % 0 1
0 40 10 50
1 3 47 50
Total 43 50 100
Statistics Value
Accuracy % 87
Accuracy of 80 % is good
Error % 13
172
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
173
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
174
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Make one of the classes (say “Low”) of output variable as the baseline level
> library(MASS)
> mymodel = polr(dd ~ effort + coverage)
> summary(mymodel)
175
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Coefficients
effort coverage
0.0234 0.0257
Intercepts
1.4947 3.925
176
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Predicted values
> pred = predict(mymodel)
> fit = fitted(mymodel)
>fit
> output = cbind(dd, pred)
>write.csv(output, "E:/Infosys/Part 2/output.csv")
177
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Comparing Actual Vs Predicted
> mytable = table(dd, pred)
> mytable
> prop.table(mytable)
Predicted
High Low Medium
High 8 42 0
Actual
Low 0 105 0
Medium 1 44 0
178
Indian Statistical Institute
ORDINAL LOGISTIC REGRESSION
Example 1: The data on system test defect density along with testing effort and
test coverage is given in ST_Defects.csv. The defect density is
classified as Low Medium High. Develop a model to estimate the
system testing defect density class based on testing effort and test
coverage ?
Comparing Actual Vs Predicted (in %)
Predicted
High Low Medium
High 4.0 21.0 0.00
Actual
Low 0.00 52.50 0.00
Medium 0.50 22.0 0.00
179
For other queries mail me at [email protected]
THANK YOU
180 CAIML talk by Tanujit Chakraborty