CM
CM
Q. Examine ANOVA in R? State the assumptions and explain One way ANOVA & Two Way ANOVA in
detail. Also state benefits( & limitation) of ANOVA.
1) ANOVA, or Analysis of Variance is a parametric statistical technique that helps in finding out if there is a
significant difference between the mean of three or more groups. It checks the impact of various factors by
comparing groups (samples) based on their respective mean. ANOVA tests the null hypothesis that all group means
are equal, against the alternative hypothesis that at least one group mean is different. 2) Assumptions - a) The
dependent variable is approximately normally distributed within each group. This assumption is more critical for
smaller sample sizes. b) The samples are selected at random and should be independent of one another.
c) All groups have equal standard deviations.d) Each data point should belong to one and only one group. There
should be no overlap or sharing of data points between groups. 3) Benefits of ANOVA: a) Allows comparison of
more than two groups simultaneously. b) Identifies whether significant differences exist without requiring multiple
t-tests. c) Reduces the risk of Type I error (false positives). 4) Limitations of ANOVA: a) Sensitive to violations of
assumptions (e.g., normality, homogeneity). b) Does not indicate which groups differ (requires post-hoc tests).
c) Limited to numerical dependent variables and categorical independent variables.
5) One way Anova - One-way ANOVA: This is the most basic form of ANOVA and is used when there is only one
independent variable with more than two levels or groups. It assesses whether there are any statistically significant
differences among the means of the groups. Example with Procedure for One way Anova: Imagine a researcher
wants to test the effectiveness of three different fertilizers (A, B, and C) on plant growth. They apply each fertilizer
to a separate group of plants and measure the height of the plants after a certain period. a) Hypotheses: Null
Hypothesis (H₀): There is no significant difference in the mean plant height between the three fertilizer groups.
Alternative Hypothesis (H₁): At least one group's mean plant height is significantly different from the others. b)
Data Collection: Collect data on plant height for each fertilizer group. c) ANOVA Test: Perform the ANOVA test to
calculate the F-statistic and p-value. d) Decision Making: If the p-value is less than the significance level (e.g., 0.05),
reject the null hypothesis. This indicates that at least one group's mean is significantly different from the others. If
the p-value is greater than the significance level, fail to reject the null hypothesis. This suggests that there is no
significant difference between the group means. e) Post-hoc Tests: If the ANOVA test shows a significant difference,
post-hoc tests like Tukey's HSD or Bonferroni's test can be used to determine which specific groups differ
significantly from each other. By using one-way ANOVA, the researcher can determine if there is a significant
difference in plant growth due to the different fertilizers and identify which fertilizer is most effective.
6) Two Way Anova: Two-Way ANOVA is a statistical test used to determine the effect of two independent
categorical variables (factors) on a continuous dependent variable, and to explore whether there is an interaction
between these factors. Example with procedure in Two Way ANOVA: A researcher tests the effects of three
fertilizers (A, B, C) and two watering frequencies (daily, every other day) on plant height. a) Hypothesis - Null
Hypothesis - Fertilizer type: Mean plant heights are the same for all fertilizers, Watering frequency: Mean plant
heights are the same for both watering schedules, Interaction: No interaction exists between fertilizer type and
watering frequency. Alternative Hypotheses: Fertilizer type: At least one fertilizer produces a different mean
height. Watering frequency: At least one watering schedule produces a different mean height. Interaction: The
effect of fertilizer type on height depends on watering frequency.b) Data Collection: Measure plant heights for all
fertilizer–watering combinations. c) ANOVA Test: Calculate F-statistics and p-values for: Fertilizer (Factor 1),
Watering frequency (Factor 2), Interaction (Fertilizer × Watering), d) Comparison: Reject null hypotheses if p-values
< 0.05. e) Decision: Significant results indicate differences in group means or interaction effects.
f) Post-Hoc Tests: Perform Tukey’s HSD to identify specific group differences if significant.
7) Application of Anova - a) Business & Marketing - Comparing the effectiveness of three marketing campaigns.
b) Education - Comparing student performance under three different teaching methods c) Healthcare
Q. Describe descriptive analytics in R. Explain any three functions of descriptive analytics in R.
1) Descriptive analytics is the process of summarizing and analyzing historical data to uncover patterns, trends, and
insights. In R, descriptive analytics is widely used to compute statistical summaries, visualize data, and understand
its distribution. 2) R provides powerful tools and functions to perform descriptive analytics efficiently, helping
analysts make data-driven decisions. 3) Key Functions for Descriptive Analytics in R a) summary(): Purpose:
Provides a quick summary of an object (e.g., data frame, vector). Output: For numeric data: Minimum, 1st Quartile,
Median, Mean, 3rd Quartile, and Maximum. For categorical data: Count of each level. Example:
data <- c(10, 20, 30, 40, 50)
summary(data)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 20 30 30 40 50
b) mean() and sd() : Purpose: Compute the mean (average) and standard deviation of numeric data. Use: mean()
calculates the central tendency of data. sd() measures the spread or dispersion of data. Example:
data <- c(10, 20, 30, 40, 50)
mean(data) # Output: 30
sd(data) # Output: 15.81139
c) table(): Purpose: Creates frequency tables for categorical data., Use: Understand the distribution of categories in
a dataset. Example: category <- c("A", "B", "A", "C", "A", "B", "C")
table(category)
Output A B C
322
Benefits of Descriptive analytics in R - a) Comprehensive Data Summarization - R allows quick computation of key
statistical measures like mean, median, variance, and standard deviation, providing a clear understanding of data
distribution and trends. b) Easy Data Visualization - With built-in packages like ggplot2, lattice, and base R plots, R
offers a wide range of visualizations (e.g., histograms, bar charts, scatter plots), helping users better understand
data patterns and relationships. c) Handling Large Datasets - R efficiently handles large datasets using functions like
data.table and dplyr, enabling fast and memory-efficient data manipulation and summarization. d)
Customizability - R's flexibility allows users to customize statistical summaries, plots, and tables to meet specific
analytical needs, making it suitable for diverse applications. c) Rich Library of Functions - R has numerous built-in
functions and packages (e.g., summary(), psych) that streamline descriptive analytics, making complex tasks easy.
Q.22) Explain Z test of hypothesis testing. Write syntax and explain in detail?
A Z-Test is a statistical method used to determine whether there is a significant difference between sample data
and a population parameter (mean or proportion) or between two samples. It is applicable when the sample size is
large (n>30) or when the population variance is known. 1) When to Use a Z-Test: a) For Population Mean: Used
to compare a sample mean with a known population mean (μ). b) For Population Proportion: Used to compare a
sample proportion (p) with a population proportion (P0). c) Between Two Samples: Used to test the difference
between two sample means or proportions. d) Conditions for Z-Test: Data is normally distributed, or sample size is
large (n>30). Population variance (σ2) is known. 2) Hypotheses in Z-Test : a) Null Hypothesis : Assumes no effect
or difference (e.g., μ=μ0). b) Alternative Hypothesis : Assumes an effect or difference (e.g., μ≠μ0) 3) Formula for Z
Test
a) For Single Sample Mean: Where:
xˉ: Sample mean , μ : Population mean, σ: Population standard deviation
n: Sample size
b) For Single Sample Proportion
p: Sample proportion p0: Population proportion, n: Sample size
Steps in Performing a Z-Test : 1) Define Hypotheses: H0: Null hypothesis, Ha: Alternative hypothesis 2)
Determine the Level of Significance (α): Common values: 0.05, 0.01 3) Compute the Z-Statistic: Use the
appropriate formula. 4) Determine the Critical Value: Use a Z-table to find the critical value corresponding to α.
5) Make a Decision: If ∣Z∣>Zcritical : Reject H0, If ∣Z ∣≤Zcritical: Fail to reject H0 6) Draw Conclusion - Based on
the decision, interpret the results in the context of the problem. Syntax - # Example: Single sample Z-test
z_test <- function(sample_mean, population_mean, population_sd, n) {
z_stat <- (sample_mean - population_mean) / (population_sd / sqrt(n))
p_value <- 2 * (1 - pnorm(abs(z_stat)))
return(list(Z_Statistic = z_stat, P_Value = p_value)) }
Q.17) Explain t-test of hypothesis testing. Write syntax in detail.
The t-test is a statistical hypothesis test that determines whether the means of one or two groups are significantly
different from a known value or each other. It uses the t-distribution, which is a family of probability distributions
suitable for small sample sizes. Why Use a T-Test? - To determine if a treatment or intervention has an effect, To
compare two datasets (e.g., control vs. experimental groups), To assess relationships in paired or dependent data
(e.g., pre- and post-measurements). Key Concepts in T-Test 1) Hypotheses a) Null Hypothesis (H₀): Assumes no
significant difference (e.g., group means are equal). b) Alternative Hypothesis (H₁): Assumes a significant difference
(e.g., group means are not equal, one mean is greater/lesser than the other). 2) Types of T-Tests a) One-
Sample T-Test: Compares the mean of a single sample to a known value (e.g., population mean). Example: Testing if
the average IQ of a class differs from the population average of 100. b) Two-Sample (Independent) T-Test:
Compares the means of two independent groups. Example: Testing if men and women have different average
heights. c) Paired T-Test: Compares means of related groups or matched pairs (e.g., before-and-after observations).
Example: Testing the effect of a drug by comparing blood pressure before and after treatment. 3) Assumptions
a) Data is continuous (interval or ratio scale). b) Data is approximately normally distributed (especially important for
small samples). c) For the two-sample t-test: The two groups should have similar variances (use Welch’s t-test if this
is not true). Observations between the groups must be independent. d) For paired t-tests, the differences between
paired observations should be normally distributed. 4) Degrees of Freedom (df) Represents the number of values
that can vary independently in a calculation. a) For one-sample t-tests: df=n−1 (where n is the sample size). b) For
two-sample t-tests: df=n1+n2−2 (where n1 and n2 are sample sizes of the two groups). c) For paired t-tests: df=n−1
(based on paired differences). 5) T-Statistic - Measures how many standard deviations the sample mean is from the
population mean (or the difference between sample means relative to variability). Higher absolute values of the t-
statistic indicate greater differences. Formula: t = (Observed Difference in Means−Hypothesized Difference) /
Standard error of the mean. 6) Syntax -
a) One Sample Test - b) Two Sample Test - c) Paired T- test
Q.19) What is confusion matrix? How confusion matrix can be used to evaluate the accuracy of the model?
1) A confusion matrix is a table that summarizes the performance of a classification model by comparing the
predicted labels with the actual labels. It is particularly useful for evaluating the performance of classification
models because it provides detailed insights into how well the model distinguishes between classes. 2) Structure of
Confusion Matrix For a binary classification problem, the confusion matrix is a 2x2 table with the following
components: True Positive (TP): Correctly
predicted positive instances (model predicts
"Yes" when actual is "Yes"). True Negative (TN):
Correctly predicted negative instances (model
predicts "No" when the actual is "No"). False
Positive (FP): Incorrectly predicted positive
instances (model predicts "Yes" when the actual is "No"). Also known as a Type I error. False Negative (FN):
Incorrectly predicted negative instances (model predicts "No" when the actual is "Yes"). Also known as a Type II
error. 3) How Confusion Matrix Evaluates Model Accuracy - The confusion matrix allows us to compute various
performance metrics to evaluate the accuracy and reliability of the model. a) Accuracy Definition: The proportion of
correct predictions out of the total predictions. Formula: Accuracy = (TP +TN)/TP + TN + FP + FN. Use: Indicates the
overall correctness of the model. b) Precision Definition: The proportion of correctly predicted positive observations
to the total predicted positives. formula precision = TP/ (TP +FP). Use: Evaluates the relevance of the model's
positive predictions. c) Recall ( True Positive Rate) Definition: The proportion of actual positive observations
correctly identified. Formula = TP/(TP + FN). Use: Indicates how well the model identifies true positives. d)
Specificity Definition: The proportion of actual negatives correctly identified. Formula : TN/(TN + FP). Use: Indicates
how well the model identifies true negatives. e) F1-Score Definition: Harmonic mean of Precision and Recall,
balancing both metrics. Formula: F1 Score = 2*(Precision. Recall) / (Precision + Recall) Use: Helpful when the
dataset is imbalanced. Importance in Model Evaluation- Identify how the model performs on each class.
Understand trade-offs between precision & recall. Evaluate model performance in imbalanced data by Precision &
F1
Q.20) Discuss the concept of odds and probabilities in Logistic regression. How ROC curve can be used to
evaluate the accuracy of Logistic model ?
Concept of Odds and Probabilities in Logistic Regression In Logistic Regression, the output is a probability that a
given input belongs to a certain class. This probability is transformed into odds and modeled using the logit
function. Logistic regression is widely used for binary classification tasks. 1) Probability: Probability is a measure of
how likely an event is to occur. For binary classification, P(y=1)is the probability of the event occurring, and 1−P(y=1)
is the probability of it not occurring. 2) Odds: Odds represent the ratio of the probability of an event occurring to
the probability of it not occurring: odds = p(y=1)/(1-p(y=1)) 3) Logit Function : Logistic regression uses the log-odds
(logarithm of odds) as the dependent variable, making it linear: logit(P)= In (p(y=1)/(1-P(y=1)) . The logistic
regression model is expressed as: Logit(P)=β0+β1x1+β2x2+⋯+βkxk. ROC Curve for Logistic Regression Model
Evaluation : The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the
performance of a classification model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at
different classification thresholds. 1) Key Terms a) True Positive Rate (TPR) / Sensitivity: Measures the proportion
of actual positives correctly identified. formula : TPR = TP/(TP + FN) b) False Positive Rate (FPR): Measures the
proportion of actual negatives incorrectly classified as positive. Formula: FPR = FP/(FP + TN) 2) Interpreting the ROC
Curve: The x-axis represents the FPR, The y-axis represents the TPR, The curve shows how the TPR and FPR vary as
the decision threshold changes. 3) Area Under the Curve (AUC): AUC quantifies the overall ability of the model to
distinguish between classes. AUC = 1: Perfect model, AUC = 0.5: No discrimination (random guessing), Higher AUC
values indicate better model performance. 4) Advantages of the ROC Curve : Evaluates the model's performance
across all possible thresholds. Works well for imbalanced datasets by focusing on the trade-off between sensitivity
and specificity.
Q.23) What are outliers? How outliers can be detected using statistical methods?
An outlier is a data point that significantly deviates from the rest of the dataset. It may be unusually high or low
compared to the majority of values. Outliers can result from variability in data, measurement errors, or
experimental errors. 1) Why Detect Outliers? a) Impact on Analysis: Outliers can distort statistical summaries
(e.g., mean, standard deviation) and affect the results of machine learning models. b) Understanding Data:
Identifying outliers helps in understanding the dataset's variability or discovering anomalies. c) Data Quality: They
may indicate data entry errors or measurement issues that need correction. 2) Methods for Outlier Detection -
a) Z-Score Method - i) Measures how many standard deviations a data point is from the mean. ii) Data points with
∣Z∣>3 are often considered outliers. iii) Formula - Z = (x - μ)/ σ. iv) Syntax -
Scores <- c(50, 52, 53, 55, 56, 95)
mean_score <- mean(scores)
std_dev <- sd(scores)
z_scores <- (scores - mean_score) / std_dev
outliers_z <- scores[abs(z_scores) > 3]
cat("Z-Scores:", z_scores, "\n")
cat("Outliers (Z-Score Method):", outliers_z, "\n")
b) Interquartile Range (IQR) Method: i) Identifies outliers based on the dataset's quartiles. ii) formula: IQR=Q3−Q1
iii) Lower Bound: Q1−1.5×IQR, Upper Bound: Q3+1.5×IQR Data points outside these bounds are outliers. iv) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
Q1 <- quantile(scores, 0.25)
Q3 <- quantile(scores, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
outliers_iqr <- scores[scores < lower_bound | scores > upper_bound]
cat("Lower Bound:", lower_bound, "\n")
cat("Upper Bound:", upper_bound, "\n")
cat("Outliers (IQR Method):", outliers_iqr, "\n")
c) Boxplot - i) A visual method for detecting outliers using the IQR. ii) Points outside the "whiskers" of the boxplot
are outliers. iii) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
boxplot(scores, main = "Boxplot for Outlier Detection", ylab = "Scores")
outliers_boxplot <- boxplot(scores, plot = FALSE)$out
cat("Outliers (Boxplot Method):", outliers_boxplot, "\n")
d) Modified Z - Score - i) A robust alternative using the median. ii) formula - z= (0.6745* (X - Median))/MAD where
MAD is Median Absolute Deviation. iii) Threshold: ∣Z∣>3.5 iv) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
median_score <- median(scores)
mad_value <- mad(scores)
modified_z_scores <- 0.6745 * (scores - median_score) / mad_value
outliers_modified_z <- scores[abs(modified_z_scores) > 3.5]
cat("Modified Z-Scores:", modified_z_scores, "\n")
cat("Outliers (Modified Z-Score Method):", outliers_modified_z, "\n")
e) Statistical Tests: i) Grubbs' Test: Identifies one outlier in a univariate dataset., ii) Dixon’s Q-Test: Used for small
datasets to detect a single outlier iii) Syntax:
library(outliers)
data <- c(10, 12, 14, 13, 12, 500, 15)
grubbs.test(data)