0% found this document useful (0 votes)
12 views

CM

Uploaded by

abhisarote10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

CM

Uploaded by

abhisarote10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Q) Consider the employee salary database and perform all types of descriptive analysis of the data with

the help of R programming code.


- # Load the data (assume your file is 'employee_salary.csv')
employee_data <- read.csv("employee_salary.csv")
- # View the first few rows and structure
head(employee_data) str(employee_data)
- # Summary of the dataset
summary(employee_data)
- # Check for missing values
colSums(is.na(employee_data))
- # Calculate mean, median, mode of salary
mean(employee_data$Salary, na.rm = TRUE)
median(employee_data$Salary, na.rm = TRUE)
mode_salary <- sort(table(employee_data$Salary), decreasing = TRUE)[1] mode_salary
- # Calculate standard deviation and variance
sd(employee_data$Salary, na.rm = TRUE)
var(employee_data$Salary, na.rm = TRUE)
- # Frequency count for department
table(employee_data$Department)
- # Create a histogram
- hist(employee_data$Salary, main = "Salary Distribution", xlab = "Salary", col = "blue", border = "black")
- # Create a boxplot for salary
boxplot(employee_data$Salary, main = "Salary Boxplot", ylab = "Salary", col = "red")
- # Correlation between Salary and Experience
cor(employee_data$Salary, employee_data$Experience, use = "complete.obs")
- # Average salary by department
aggregate(Salary ~ Department, data = employee_data, FUN = mean, na.rm = TRUE)
Q. Explain multiple regression with its two application
1) Multiple regression is a statistical method used to understand the relationship between multiple independent
variables and a dependent variable. 2) It extends simple linear regression by considering the combined effect of
several predictors on the outcome. 3) In R, the function used to perform Multiple Linear Regression (MLR) is the
lm() function, which stands for "linear model. Syntax - lm(formula, data). 4) Equation: The general form of a
multiple regression equation Y= β0 + β1X1 + β2X2 + ⋯ + βnXn + ϵ, where Y: Dependent variable, X1 ,X2,…,Xn:
Independent variables, β0: Intercept, β1, β2,…, βn : Coefficients, ϵ : Error term. 5) Applications of Multiple
Regression- a) Business: Sales Prediction: Use Case: Predicting product sales based on factors like
advertising budget, pricing, and economic conditions. Example: Sales= β0 + β1(Advertising) + β2(Price) +
β3(Economic Index) + ϵ, Helps companies allocate resources effectively and set prices. b) Healthcare: Patient
Outcome Analysis: Use Case: Predicting a patient’s recovery time based on age, severity of illness, and treatment
type. Example: Recovery Time = β0 + β1(Age) + β2(Severity) + β3(Treatment Type) + ϵ, Helps doctors tailor
treatment plans and improve patient outcomes. 6) Assumptions: Linearity: Relationship between predictors and
the response is linear, Independence: Observations are independent of each other, Homoscedasticity: Constant
variance of residuals, Normality: Residuals are normally distributed.
Q. Explain dimension reduction techniques with example
1) Dimension reduction techniques are used to reduce the number of features (or dimensions) in a dataset while
preserving as much information as possible. This can improve model performance, reduce computational cost, and
make data visualization easier. 2) Importance of Dimension Reduction - a)Overcoming the Curse of
Dimensionality: In high-dimensional datasets, the number of samples required to make reliable predictions
increases exponentially. Reducing dimensions helps manage this complexity. b) Improving Model Performance:
High-dimensional data often contains noise or redundant information. Reducing dimensions can improve the
accuracy and efficiency of machine learning models. c) Reducing Overfitting: By eliminating irrelevant or redundant
features, models become simpler and less prone to overfitting. d) Improving Computational Efficiency: Lower
dimensions reduce the time and memory requirements for processing and training models. e) Better Visualization:
Data with fewer dimensions (2D or 3D) can be visualized more easily, helping in understanding the underlying
patterns. 3) Types of Dimension Reduction: a) Feature Extraction: Derives new features that summarize the
original features. b) Feature Selection: Identifies and retains only the most important original features.
4) Common Techniques: a) Principal Component Analysis (PCA): Projects data into a lower-dimensional space
using orthogonal transformations. Example: Reducing a dataset with 10 features to 2 principal components while
retaining maximum variance. Application: Image compression, genetics studies. b) Linear Discriminant Analysis
(LDA): Focuses on maximizing the separation between classes in labeled data. Example: Used in classification
problems to reduce dimensionality and improve accuracy. Application: Face recognition, spam detection.
c) t-Distributed Stochastic Neighbor Embedding (t-SNE): Visualizes high-dimensional data by projecting it into 2D or
3D space. Example: Visualizing clusters in customer segmentation data. Application: Exploratory data analysis,
clustering. d) Autoencoders (Deep Learning): Neural networks that compress data into a latent space and
reconstruct it. Example: Reducing noise in image datasets. Application: Anomaly detection, data denoising.
e) Feature Selection Methods: Filter Methods: Use statistical tests like correlation. Wrapper Methods: Use
algorithms like recursive feature elimination (RFE). Example: Selecting top 5 most relevant variables for predicting
house prices. Application: Predictive modeling.
Q. Discuss techniques of performance evaluation of logistic regression model.
a) Confusion Matrix: A table summarizing True Positives (TP), True Negatives (TN), False Positives (FP), and False
Negatives (FN). Provides insights into the model's accuracy, precision, recall, and F1-score. Helps understand how
well the model is classifying each category. b) Accuracy: Measures the proportion of correctly predicted instances:
Accuracy=TP+TN /(TP+TN+FP+FN), Limitation: Misleading for imbalanced datasets. Measures overall correctness by
dividing correctly predicted instances by total instances c) Precision (Positive Predictive Value): Proportion of true
positives among all predicted positives: Precision=TP/(TP+FP). Useful when false positives are costly (e.g., spam
detection). Indicates how many of the predicted positive cases are actually positive. d) Recall (Sensitivity):
Proportion of true positives among actual positives: Recall=TP/(TP+FN). Useful when false negatives are critical
(e.g., disease detection). Indicates how many actual positive cases were correctly identified. e) F1-Score - Balances
precision and recall. Useful when there's an uneven class distribution or when precision and recall need balancing.
Formula: 2*(precision * Recall)/(precision + Recall) f) ROC-AUC Curve (Receiver Operating Characteristic): Plots
True Positive Rate (TPR) vs. False Positive Rate (FPR). AUC (Area Under the Curve) evaluates model's ability to
distinguish between classes. A higher AUC indicates better model performance. g) Log Loss (Logarithmic Loss):
Measures the uncertainty of predictions. Lower log loss indicates better performance. Lower values indicate better
calibration of predicted probabilities. h) Threshold Tuning: Adjust the decision threshold to optimize precision,
recall, or other metrics based on business needs. Example: Raising the threshold reduces false positives but may
lower recall i) Cross-Validation: Splits data into training and testing sets multiple times to assess model stability and
performance. Reduces overfitting and ensures reliability. j) Classification Report: Provides a detailed breakdown of
precision, recall, F1-score, and support (number of true instances for each class). Offers a quick summary for
decision-making
Q.Explain the concept of Time Series Analysis. Discuss how time series analysis is used in business
forecasting.
Time Series Analysis is a method used to analyze and interpret data points collected over time, aiming to identify
patterns, trends, and relationships for forecasting and decision-making. It provides valuable insights into past
behaviors and equips businesses to predict future outcomes efficiently. a) Trend: Trend represents the long-term
movement or directionality of the data over time. It captures the overall tendency of the series to increase,
decrease, or remain stable. Trends can be linear, indicating a consistent increase or decrease, or nonlinear, showing
more complex patterns.rends often result from underlying factors like market growth, technological advancements,
or population changes. Visualization:Line plots help to visually identify trends. To capture trends: Apply smoothing
techniques or regression models b) Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly, monthly, or weekly and are
typically influenced by factors such as seasons, holidays, or business cycles. Line plot used to visualize seasonality.
To capture seasonality: Use models like SARIMA (Seasonal ARIMA) or Exponential Smoothing. c) Cyclic variations:
Cyclical variations are longer-term fluctuations in the time series that do not have a fixed period like seasonality.
These fluctuations represent economic or business cycles, which can extend over multiple years and are often
associated with expansions and contractions in economic activity. d) Irregularity (or Noise): Irregularity, also known
as noise or randomness, refers to the unpredictable or random fluctuations in the data that cannot be attributed to
the trend, seasonality, or cyclical variations. These fluctuations may result from random events, measurement
errors, or other unforeseen factors. Irregularity makes it challenging to identify and model the underlying patterns
in the time series data Steps in Time Series Analysis: a) Data Collection and Preprocessing: Gather time-stamped
data. Handle missing values and ensure data quality. b) Visualization: Use line plots to visualize trends and patterns.
Example: plot(ts_data) in R to create a time series plot. c) Decomposition: Break down the series into its
components (trend, seasonality, and residual) R to create a time series plot. In R: decompose() or stl() functions.
e) Stationarity Check: Ensure the series has constant mean and variance over time. Test: Augmented Dickey-Fuller
(ADF) Test. f) Modeling: Choose a suitable model based on the data characteristics: ARIMA (AutoRegressive
Integrated Moving Average): For non-seasonal data. SARIMA (Seasonal ARIMA): For data with seasonality.
Exponential Smoothing: For smoothing and forecasting. g) Forecasting: Use the fitted model to predict future
values. In R: Use forecast::forecast() for prediction. Use of Time series Analysis in Business Forecasting - sales
Forecasting, Demand Forecasting, financial Analysis, workforce Planning, Risk Management.
Q. Describe linear Discriminant analysis (LDA). Write a brief outline of R code for the same
1) Linear Discriminant Analysis (LDA) is a dimensionality reduction and classification technique used to separate
classes by finding a linear combination of features that best separates them. It works by maximizing the ratio of
between-class variance to within-class variance to ensure distinct class separation. 2) To use lda() function, one
must install the following packages: a) MASS package for lda() function. b) tidyverse package for better and easy
data manipulation and visualization. c) caret package for a better machine learning workflow. 3) Features of LDA
a) Dimensionality Reduction: Projects data into a lower-dimensional space while preserving class separability. b)
Supervised Learning: Requires labeled data for training. 4) Applications: Image recognition (e.g., face recognition),
Text classification, Disease diagnosis based on medical data. 5) Steps in LDA: a) Compute the mean for each class
and the overall mean. b) Calculate the within-class scatter matrix and between-class scatter matrix. c) Solve for
eigenvalues and eigenvectors to compute the transformation matrix. d) Project data into the new space for
classification or visualization. 6) Explanation of Key Steps in R Code:
# Load necessary libraries
library(MASS)
#Load or create dataset
data <- iris
head(data)
# Split data into training and test sets
set.seed(123)
train_index <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Perform LDA
lda_model <- lda (Species ~ " data = train_data)
print(lda_model)
# Predict on test data
predictions <- predict(lda_model, test_data)
head(predictions$class)
# Evaluate the model
confusion_matrix <-table(predictions$class, test_data$Species)
print(confusion_matrix)
# Visualize LDA results
plot(lda_model, col = as.numeric(train_data$Species))
Application of LDA - a) Customer Segmentation - Classify customers into distinct groups based on their purchasing
behavior and demographics.b) Churn Prediction Predict whether a customer will leave (churn) or stay with a service
or product. c) Customer Sentiment Analysis Classify customer reviews or feedback as positive, negative, or
neutral.d) Targeted Advertising Classify users based on their likelihood to engage with specific ads or offers. e)
Product Recommendations Predict the type of products a customer is most likely to purchase based on past
behavior and preferences.

Q. Examine ANOVA in R? State the assumptions and explain One way ANOVA & Two Way ANOVA in
detail. Also state benefits( & limitation) of ANOVA.
1) ANOVA, or Analysis of Variance is a parametric statistical technique that helps in finding out if there is a
significant difference between the mean of three or more groups. It checks the impact of various factors by
comparing groups (samples) based on their respective mean. ANOVA tests the null hypothesis that all group means
are equal, against the alternative hypothesis that at least one group mean is different. 2) Assumptions - a) The
dependent variable is approximately normally distributed within each group. This assumption is more critical for
smaller sample sizes. b) The samples are selected at random and should be independent of one another.
c) All groups have equal standard deviations.d) Each data point should belong to one and only one group. There
should be no overlap or sharing of data points between groups. 3) Benefits of ANOVA: a) Allows comparison of
more than two groups simultaneously. b) Identifies whether significant differences exist without requiring multiple
t-tests. c) Reduces the risk of Type I error (false positives). 4) Limitations of ANOVA: a) Sensitive to violations of
assumptions (e.g., normality, homogeneity). b) Does not indicate which groups differ (requires post-hoc tests).
c) Limited to numerical dependent variables and categorical independent variables.
5) One way Anova - One-way ANOVA: This is the most basic form of ANOVA and is used when there is only one
independent variable with more than two levels or groups. It assesses whether there are any statistically significant
differences among the means of the groups. Example with Procedure for One way Anova: Imagine a researcher
wants to test the effectiveness of three different fertilizers (A, B, and C) on plant growth. They apply each fertilizer
to a separate group of plants and measure the height of the plants after a certain period. a) Hypotheses: Null
Hypothesis (H₀): There is no significant difference in the mean plant height between the three fertilizer groups.
Alternative Hypothesis (H₁): At least one group's mean plant height is significantly different from the others. b)
Data Collection: Collect data on plant height for each fertilizer group. c) ANOVA Test: Perform the ANOVA test to
calculate the F-statistic and p-value. d) Decision Making: If the p-value is less than the significance level (e.g., 0.05),
reject the null hypothesis. This indicates that at least one group's mean is significantly different from the others. If
the p-value is greater than the significance level, fail to reject the null hypothesis. This suggests that there is no
significant difference between the group means. e) Post-hoc Tests: If the ANOVA test shows a significant difference,
post-hoc tests like Tukey's HSD or Bonferroni's test can be used to determine which specific groups differ
significantly from each other. By using one-way ANOVA, the researcher can determine if there is a significant
difference in plant growth due to the different fertilizers and identify which fertilizer is most effective.
6) Two Way Anova: Two-Way ANOVA is a statistical test used to determine the effect of two independent
categorical variables (factors) on a continuous dependent variable, and to explore whether there is an interaction
between these factors. Example with procedure in Two Way ANOVA: A researcher tests the effects of three
fertilizers (A, B, C) and two watering frequencies (daily, every other day) on plant height. a) Hypothesis - Null
Hypothesis - Fertilizer type: Mean plant heights are the same for all fertilizers, Watering frequency: Mean plant
heights are the same for both watering schedules, Interaction: No interaction exists between fertilizer type and
watering frequency. Alternative Hypotheses: Fertilizer type: At least one fertilizer produces a different mean
height. Watering frequency: At least one watering schedule produces a different mean height. Interaction: The
effect of fertilizer type on height depends on watering frequency.b) Data Collection: Measure plant heights for all
fertilizer–watering combinations. c) ANOVA Test: Calculate F-statistics and p-values for: Fertilizer (Factor 1),
Watering frequency (Factor 2), Interaction (Fertilizer × Watering), d) Comparison: Reject null hypotheses if p-values
< 0.05. e) Decision: Significant results indicate differences in group means or interaction effects.
f) Post-Hoc Tests: Perform Tukey’s HSD to identify specific group differences if significant.
7) Application of Anova - a) Business & Marketing - Comparing the effectiveness of three marketing campaigns.
b) Education - Comparing student performance under three different teaching methods c) Healthcare
Q. Describe descriptive analytics in R. Explain any three functions of descriptive analytics in R.
1) Descriptive analytics is the process of summarizing and analyzing historical data to uncover patterns, trends, and
insights. In R, descriptive analytics is widely used to compute statistical summaries, visualize data, and understand
its distribution. 2) R provides powerful tools and functions to perform descriptive analytics efficiently, helping
analysts make data-driven decisions. 3) Key Functions for Descriptive Analytics in R a) summary(): Purpose:
Provides a quick summary of an object (e.g., data frame, vector). Output: For numeric data: Minimum, 1st Quartile,
Median, Mean, 3rd Quartile, and Maximum. For categorical data: Count of each level. Example:
data <- c(10, 20, 30, 40, 50)
summary(data)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10 20 30 30 40 50
b) mean() and sd() : Purpose: Compute the mean (average) and standard deviation of numeric data. Use: mean()
calculates the central tendency of data. sd() measures the spread or dispersion of data. Example:
data <- c(10, 20, 30, 40, 50)
mean(data) # Output: 30
sd(data) # Output: 15.81139
c) table(): Purpose: Creates frequency tables for categorical data., Use: Understand the distribution of categories in
a dataset. Example: category <- c("A", "B", "A", "C", "A", "B", "C")
table(category)
Output A B C
322
Benefits of Descriptive analytics in R - a) Comprehensive Data Summarization - R allows quick computation of key
statistical measures like mean, median, variance, and standard deviation, providing a clear understanding of data
distribution and trends. b) Easy Data Visualization - With built-in packages like ggplot2, lattice, and base R plots, R
offers a wide range of visualizations (e.g., histograms, bar charts, scatter plots), helping users better understand
data patterns and relationships. c) Handling Large Datasets - R efficiently handles large datasets using functions like
data.table and dplyr, enabling fast and memory-efficient data manipulation and summarization. d)
Customizability - R's flexibility allows users to customize statistical summaries, plots, and tables to meet specific
analytical needs, making it suitable for diverse applications. c) Rich Library of Functions - R has numerous built-in
functions and packages (e.g., summary(), psych) that streamline descriptive analytics, making complex tasks easy.
Q.22) Explain Z test of hypothesis testing. Write syntax and explain in detail?
A Z-Test is a statistical method used to determine whether there is a significant difference between sample data
and a population parameter (mean or proportion) or between two samples. It is applicable when the sample size is
large (n>30) or when the population variance is known. 1) When to Use a Z-Test: a) For Population Mean: Used
to compare a sample mean with a known population mean (μ). b) For Population Proportion: Used to compare a
sample proportion (p) with a population proportion (P0). c) Between Two Samples: Used to test the difference
between two sample means or proportions. d) Conditions for Z-Test: Data is normally distributed, or sample size is
large (n>30). Population variance (σ2) is known. 2) Hypotheses in Z-Test : a) Null Hypothesis : Assumes no effect
or difference (e.g., μ=μ0). b) Alternative Hypothesis : Assumes an effect or difference (e.g., μ≠μ0) 3) Formula for Z
Test
a) For Single Sample Mean: Where:
xˉ: Sample mean , μ : Population mean, σ: Population standard deviation
n: Sample size
b) For Single Sample Proportion
p: Sample proportion p0: Population proportion, n: Sample size

c) For Two Sample Means:

xˉ1,xˉ2: Sample means μ1,μ2: Population means, σ1,σ2: Population

standard deviations n1,n2: Sample sizes

d) For Two Sample Proportions:

p1,p2: Sample proportions, P: Pooled proportion, n1,n2: Sample sizes

Steps in Performing a Z-Test : 1) Define Hypotheses: H0: Null hypothesis, Ha: Alternative hypothesis 2)
Determine the Level of Significance (α): Common values: 0.05, 0.01 3) Compute the Z-Statistic: Use the
appropriate formula. 4) Determine the Critical Value: Use a Z-table to find the critical value corresponding to α.
5) Make a Decision: If ∣Z∣>Zcritical : Reject H0, If ∣Z ∣≤Zcritical: Fail to reject H0 6) Draw Conclusion - Based on
the decision, interpret the results in the context of the problem. Syntax - # Example: Single sample Z-test
z_test <- function(sample_mean, population_mean, population_sd, n) {
z_stat <- (sample_mean - population_mean) / (population_sd / sqrt(n))
p_value <- 2 * (1 - pnorm(abs(z_stat)))
return(list(Z_Statistic = z_stat, P_Value = p_value)) }
Q.17) Explain t-test of hypothesis testing. Write syntax in detail.
The t-test is a statistical hypothesis test that determines whether the means of one or two groups are significantly
different from a known value or each other. It uses the t-distribution, which is a family of probability distributions
suitable for small sample sizes. Why Use a T-Test? - To determine if a treatment or intervention has an effect, To
compare two datasets (e.g., control vs. experimental groups), To assess relationships in paired or dependent data
(e.g., pre- and post-measurements). Key Concepts in T-Test 1) Hypotheses a) Null Hypothesis (H₀): Assumes no
significant difference (e.g., group means are equal). b) Alternative Hypothesis (H₁): Assumes a significant difference
(e.g., group means are not equal, one mean is greater/lesser than the other). 2) Types of T-Tests a) One-
Sample T-Test: Compares the mean of a single sample to a known value (e.g., population mean). Example: Testing if
the average IQ of a class differs from the population average of 100. b) Two-Sample (Independent) T-Test:
Compares the means of two independent groups. Example: Testing if men and women have different average
heights. c) Paired T-Test: Compares means of related groups or matched pairs (e.g., before-and-after observations).
Example: Testing the effect of a drug by comparing blood pressure before and after treatment. 3) Assumptions
a) Data is continuous (interval or ratio scale). b) Data is approximately normally distributed (especially important for
small samples). c) For the two-sample t-test: The two groups should have similar variances (use Welch’s t-test if this
is not true). Observations between the groups must be independent. d) For paired t-tests, the differences between
paired observations should be normally distributed. 4) Degrees of Freedom (df) Represents the number of values
that can vary independently in a calculation. a) For one-sample t-tests: df=n−1 (where n is the sample size). b) For
two-sample t-tests: df=n1+n2−2 (where n1 and n2 are sample sizes of the two groups). c) For paired t-tests: df=n−1
(based on paired differences). 5) T-Statistic - Measures how many standard deviations the sample mean is from the
population mean (or the difference between sample means relative to variability). Higher absolute values of the t-
statistic indicate greater differences. Formula: t = (Observed Difference in Means−Hypothesized Difference) /
Standard error of the mean. 6) Syntax -
a) One Sample Test - b) Two Sample Test - c) Paired T- test

Q.19) What is confusion matrix? How confusion matrix can be used to evaluate the accuracy of the model?
1) A confusion matrix is a table that summarizes the performance of a classification model by comparing the
predicted labels with the actual labels. It is particularly useful for evaluating the performance of classification
models because it provides detailed insights into how well the model distinguishes between classes. 2) Structure of
Confusion Matrix For a binary classification problem, the confusion matrix is a 2x2 table with the following
components: True Positive (TP): Correctly
predicted positive instances (model predicts
"Yes" when actual is "Yes"). True Negative (TN):
Correctly predicted negative instances (model
predicts "No" when the actual is "No"). False
Positive (FP): Incorrectly predicted positive
instances (model predicts "Yes" when the actual is "No"). Also known as a Type I error. False Negative (FN):
Incorrectly predicted negative instances (model predicts "No" when the actual is "Yes"). Also known as a Type II
error. 3) How Confusion Matrix Evaluates Model Accuracy - The confusion matrix allows us to compute various
performance metrics to evaluate the accuracy and reliability of the model. a) Accuracy Definition: The proportion of
correct predictions out of the total predictions. Formula: Accuracy = (TP +TN)/TP + TN + FP + FN. Use: Indicates the
overall correctness of the model. b) Precision Definition: The proportion of correctly predicted positive observations
to the total predicted positives. formula precision = TP/ (TP +FP). Use: Evaluates the relevance of the model's
positive predictions. c) Recall ( True Positive Rate) Definition: The proportion of actual positive observations
correctly identified. Formula = TP/(TP + FN). Use: Indicates how well the model identifies true positives. d)
Specificity Definition: The proportion of actual negatives correctly identified. Formula : TN/(TN + FP). Use: Indicates
how well the model identifies true negatives. e) F1-Score Definition: Harmonic mean of Precision and Recall,
balancing both metrics. Formula: F1 Score = 2*(Precision. Recall) / (Precision + Recall) Use: Helpful when the
dataset is imbalanced. Importance in Model Evaluation- Identify how the model performs on each class.
Understand trade-offs between precision & recall. Evaluate model performance in imbalanced data by Precision &
F1
Q.20) Discuss the concept of odds and probabilities in Logistic regression. How ROC curve can be used to
evaluate the accuracy of Logistic model ?
Concept of Odds and Probabilities in Logistic Regression In Logistic Regression, the output is a probability that a
given input belongs to a certain class. This probability is transformed into odds and modeled using the logit
function. Logistic regression is widely used for binary classification tasks. 1) Probability: Probability is a measure of
how likely an event is to occur. For binary classification, P(y=1)is the probability of the event occurring, and 1−P(y=1)
is the probability of it not occurring. 2) Odds: Odds represent the ratio of the probability of an event occurring to
the probability of it not occurring: odds = p(y=1)/(1-p(y=1)) 3) Logit Function : Logistic regression uses the log-odds
(logarithm of odds) as the dependent variable, making it linear: logit(P)= In (p(y=1)/(1-P(y=1)) . The logistic
regression model is expressed as: Logit(P)=β0+β1x1+β2x2+⋯+βkxk. ROC Curve for Logistic Regression Model
Evaluation : The Receiver Operating Characteristic (ROC) curve is a graphical representation used to evaluate the
performance of a classification model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at
different classification thresholds. 1) Key Terms a) True Positive Rate (TPR) / Sensitivity: Measures the proportion
of actual positives correctly identified. formula : TPR = TP/(TP + FN) b) False Positive Rate (FPR): Measures the
proportion of actual negatives incorrectly classified as positive. Formula: FPR = FP/(FP + TN) 2) Interpreting the ROC
Curve: The x-axis represents the FPR, The y-axis represents the TPR, The curve shows how the TPR and FPR vary as
the decision threshold changes. 3) Area Under the Curve (AUC): AUC quantifies the overall ability of the model to
distinguish between classes. AUC = 1: Perfect model, AUC = 0.5: No discrimination (random guessing), Higher AUC
values indicate better model performance. 4) Advantages of the ROC Curve : Evaluates the model's performance
across all possible thresholds. Works well for imbalanced datasets by focusing on the trade-off between sensitivity
and specificity.

Q.23) What are outliers? How outliers can be detected using statistical methods?
An outlier is a data point that significantly deviates from the rest of the dataset. It may be unusually high or low
compared to the majority of values. Outliers can result from variability in data, measurement errors, or
experimental errors. 1) Why Detect Outliers? a) Impact on Analysis: Outliers can distort statistical summaries
(e.g., mean, standard deviation) and affect the results of machine learning models. b) Understanding Data:
Identifying outliers helps in understanding the dataset's variability or discovering anomalies. c) Data Quality: They
may indicate data entry errors or measurement issues that need correction. 2) Methods for Outlier Detection -
a) Z-Score Method - i) Measures how many standard deviations a data point is from the mean. ii) Data points with
∣Z∣>3 are often considered outliers. iii) Formula - Z = (x - μ)/ σ. iv) Syntax -
Scores <- c(50, 52, 53, 55, 56, 95)
mean_score <- mean(scores)
std_dev <- sd(scores)
z_scores <- (scores - mean_score) / std_dev
outliers_z <- scores[abs(z_scores) > 3]
cat("Z-Scores:", z_scores, "\n")
cat("Outliers (Z-Score Method):", outliers_z, "\n")
b) Interquartile Range (IQR) Method: i) Identifies outliers based on the dataset's quartiles. ii) formula: IQR=Q3−Q1
iii) Lower Bound: Q1−1.5×IQR, Upper Bound: Q3+1.5×IQR Data points outside these bounds are outliers. iv) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
Q1 <- quantile(scores, 0.25)
Q3 <- quantile(scores, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
outliers_iqr <- scores[scores < lower_bound | scores > upper_bound]
cat("Lower Bound:", lower_bound, "\n")
cat("Upper Bound:", upper_bound, "\n")
cat("Outliers (IQR Method):", outliers_iqr, "\n")
c) Boxplot - i) A visual method for detecting outliers using the IQR. ii) Points outside the "whiskers" of the boxplot
are outliers. iii) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
boxplot(scores, main = "Boxplot for Outlier Detection", ylab = "Scores")
outliers_boxplot <- boxplot(scores, plot = FALSE)$out
cat("Outliers (Boxplot Method):", outliers_boxplot, "\n")
d) Modified Z - Score - i) A robust alternative using the median. ii) formula - z= (0.6745* (X - Median))/MAD where
MAD is Median Absolute Deviation. iii) Threshold: ∣Z∣>3.5 iv) Syntax
scores <- c(50, 52, 53, 55, 56, 95)
median_score <- median(scores)
mad_value <- mad(scores)
modified_z_scores <- 0.6745 * (scores - median_score) / mad_value
outliers_modified_z <- scores[abs(modified_z_scores) > 3.5]
cat("Modified Z-Scores:", modified_z_scores, "\n")
cat("Outliers (Modified Z-Score Method):", outliers_modified_z, "\n")
e) Statistical Tests: i) Grubbs' Test: Identifies one outlier in a univariate dataset., ii) Dixon’s Q-Test: Used for small
datasets to detect a single outlier iii) Syntax:
library(outliers)
data <- c(10, 12, 14, 13, 12, 500, 15)
grubbs.test(data)

You might also like