0% found this document useful (0 votes)
5 views24 pages

R Viva Ques

The document provides an overview of R programming, its advantages, and essential functionalities for data analysis, including package installation, data structures, and control flows. It covers descriptive statistics, predictive analytics, and practical coding examples in R for statistical computations and visualizations. Additionally, it explains key concepts like regression analysis, correlation, and the use of various statistical functions in R.

Uploaded by

Eshika Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views24 pages

R Viva Ques

The document provides an overview of R programming, its advantages, and essential functionalities for data analysis, including package installation, data structures, and control flows. It covers descriptive statistics, predictive analytics, and practical coding examples in R for statistical computations and visualizations. Additionally, it explains key concepts like regression analysis, correlation, and the use of various statistical functions in R.

Uploaded by

Eshika Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT 3: Getting Started with R (6 Hours)

Introduction to R, Advantages, Installing Packages, Importing Data, Commands & Syntax,


Packages & Libraries, Data Structures, Control Flows, Loops, Functions, Apply family.

1. What is R and why is it used?


Answer:​
R is a free, open-source programming language and software environment used for statistical
computing, data analysis, and visualization. It is widely used in data science, machine learning,
and academia.

2. What are the main advantages of R?


Answer:

●​ Open source and free to use​

●​ Rich set of packages for data analysis​

●​ Great visualization capabilities (e.g., ggplot2)​

●​ Large and active community​

●​ Widely used in academia and industry​

3. How do you install and load a package in R?


Answer:

R
CopyEdit
install.packages("ggplot2") # To install
library(ggplot2) # To load

4. How can you import data from a CSV file in R?


Answer:
R
CopyEdit
data <- read.csv("filename.csv")

5. What is the difference between a vector and a list in R?


Answer:

●​ Vector: Homogeneous data type (e.g., all numeric)​

●​ List: Heterogeneous, can contain different types (e.g., numbers, strings, vectors)​

6. What are data frames and how are they different from matrices?
Answer:

●​ Data Frame: Table-like structure with columns of different types​

●​ Matrix: 2D structure with elements of the same type​

7. How do you create a vector in R?


Answer:

R
CopyEdit
vec <- c(1, 2, 3, 4)

8. How do you create a matrix in R?


Answer:

R
CopyEdit
mat <- matrix(1:6, nrow=2, ncol=3)

9. How do you create a list in R?


Answer:

R
CopyEdit
lst <- list(name="Alex", age=25, scores=c(90, 80, 85))

10. What are factors in R? Why are they used?


Answer:​
Factors are used to represent categorical data. They store both the values and the levels.

R
CopyEdit
factor_var <- factor(c("low", "medium", "high"))

11. What are conditionals in R? Give an example.


Answer:

R
CopyEdit
x <- 5
if (x > 0) {
print("Positive")
} else {
print("Non-positive")
}

12. What are loops in R? Give an example of a for loop.


Answer:

R
CopyEdit
for (i in 1:5) {
print(i^2)
}

13. How do you define a function in R?


Answer:

R
CopyEdit
square <- function(x) {
return(x^2)
}
square(4)

14. What is the purpose of the apply() function?


Answer:​
apply() is used to apply a function over the rows or columns of a matrix or data frame.

R
CopyEdit
apply(matrix(1:9, nrow=3), 1, sum) # Row-wise sum

15. What is the difference between lapply() and sapply()?


Answer:

●​ lapply() always returns a list​

●​ sapply() tries to simplify the result into a vector or matrix​

R
CopyEdit
x <- list(a=1:3, b=4:6)
lapply(x, sum) # List of sums
sapply(x, sum) # Vector of sums

16. What is the tapply() function used for?


Answer:​
tapply() applies a function over subsets of a vector, defined by a factor or grouping variable.

R
CopyEdit
ages <- c(21, 25, 19, 23)
gender <- factor(c("M", "F", "M", "F"))
tapply(ages, gender, mean) # Mean age by gender

17. What is a library in R?


Answer:​
A library in R is a collection of R packages. We load packages using the library() function.

18. How do you check the structure of a data object?


Answer:

R
CopyEdit
str(data)

19. How do you check the summary statistics of a data frame?


Answer:

R
CopyEdit
summary(data)

20. What are the common data types in R?


Answer:

●​ Numeric​

●​ Integer​

●​ Character​

●​ Logical​

●​ Complex​

●​ Factor​
UNIT 4: Descriptive Statistics Using R (6 Hours)
Topics: Data Import, Data Visualization, Measures of Central Tendency, Measures
of Dispersion, Covariance, Correlation, Coefficient of Determination

🔸 Section A: Theoretical Questions


1. What is descriptive statistics?
Answer:​
Descriptive statistics refers to methods for summarizing and organizing data. It includes:

●​ Measures of central tendency (mean, median, mode)​

●​ Measures of dispersion (range, variance, standard deviation)​

●​ Charts and graphs for data visualization​

2. What are measures of central tendency? Explain each with an example.


Answer:​
These describe the center of a data set:

●​ Mean: Arithmetic average​

●​ Median: Middle value when data is sorted​

●​ Mode: Most frequent value​


Example: For data = {2, 3, 4, 5, 5}​

●​ Mean = (2+3+4+5+5)/5 = 3.8​

●​ Median = 4​

●​ Mode = 5​

3. What are measures of dispersion? Why are they important?


Answer:​
They measure how spread out the data is. Common ones include:

●​ Range: Difference between max and min​

●​ Variance: Average squared deviation from the mean​

●​ Standard Deviation: Square root of variance​

●​ IQR: Interquartile range (Q3 - Q1)​

They help understand data variability and consistency.

4. What is covariance? What does its sign indicate?


Answer:​
Covariance measures the directional relationship between two variables:

●​ Positive covariance = variables move in same direction​

●​ Negative covariance = move in opposite direction​

●​ Near zero = weak or no relationship​

5. What is correlation? How is it different from covariance?


Answer:​
Correlation standardizes covariance to a [-1, 1] scale:

●​ +1 = strong positive linear relationship​

●​ 0 = no linear relationship​

●​ -1 = strong negative linear relationship​


Unlike covariance, correlation is unit-free.​

6. What is the coefficient of determination (R²)?


Answer:​
R² tells us how much of the variation in the dependent variable is explained by the independent
variable(s).
●​ Value ranges from 0 to 1​

●​ Closer to 1 = better model fit​

🔸 Section B: Practical Questions Using R


7. How do you import data from a CSV file in R?
R
CopyEdit
data <- read.csv("datafile.csv")

8. How do you calculate mean, median, and mode in R?


R
CopyEdit
mean(data$column)
median(data$column)
# Mode function (R doesn’t have built-in mode)
getmode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
getmode(data$column)

9. How do you calculate variance and standard deviation in R?


R
CopyEdit
var(data$column) # Variance
sd(data$column) # Standard deviation

10. How do you calculate the range and IQR in R?


R
CopyEdit
range(data$column)
IQR(data$column)
11. How do you calculate covariance and correlation in R?
R
CopyEdit
cov(data$x, data$y)
cor(data$x, data$y)

12. How do you calculate the coefficient of determination in R?


R
CopyEdit
model <- lm(y ~ x, data=data)
summary(model)$r.squared

13. How do you create a histogram in R?


R
CopyEdit
hist(data$column, main="Histogram", col="skyblue")

14. How do you create a bar chart in R?


R
CopyEdit
barplot(table(data$category), main="Bar Chart", col="lightgreen")

15. How do you create a boxplot in R? What does it represent?


R
CopyEdit
boxplot(data$column, main="Boxplot", col="orange")

Explanation: Shows median, quartiles, outliers, and spread of the data.

16. How do you create a scatter plot in R?


R
CopyEdit
plot(data$x, data$y, main="Scatter Plot", xlab="X", ylab="Y")
17. How do you create a line graph in R?
R
CopyEdit
plot(data$x, type="l", main="Line Graph", col="blue")

18. What is the use of the summary() function in R?


R
CopyEdit
summary(data)

Explanation: Gives min, 1st quartile, median, mean, 3rd quartile, and max for each column.

19. How do you create a pairwise correlation matrix in R?


R
CopyEdit
cor(data[, c("var1", "var2", "var3")])

20. How do you visualize correlation in R (advanced)?


R
CopyEdit
library(corrplot)
corrplot(cor(data[,c("x", "y", "z")]), method="circle")
UNIT 5: Predictive Analytics Using R – Sample Viva
Questions & Answers (Excluding Textual Analytics)

🔸 SECTION A: Theoretical Questions


1. What is Predictive Analytics?
Answer:​
Predictive analytics uses statistical techniques (like regression models) to forecast future
outcomes based on historical data.

2. What is regression analysis?


Answer:​
Regression analysis is a statistical method used to examine the relationship between a
dependent variable and one or more independent variables.

3. What is the difference between simple and multiple linear regression?


●​ Simple Linear Regression: One independent variable → one dependent variable​

●​ Multiple Linear Regression: Two or more independent variables → one dependent


variable​

4. What are the assumptions of a linear regression model?


1.​ Linearity​

2.​ Independence of errors​

3.​ Homoscedasticity (constant variance of errors)​

4.​ Normality of errors​

5.​ No multicollinearity (in multiple regression)​


5. What does the R-squared value indicate?
Answer:​
R² (Coefficient of determination) tells us the proportion of variance in the dependent
variable that is predictable from the independent variables.

6. What is the adjusted R-squared?


Answer:​
Adjusted R² adjusts the R² value for the number of predictors, and is a better indicator when
comparing models with different numbers of variables.

7. What is heteroscedasticity?
Answer:​
It occurs when the variance of residuals is not constant across all levels of the independent
variable(s). It violates a key regression assumption and affects the accuracy of coefficient
estimates.

8. What is multicollinearity?
Answer:​
Multicollinearity occurs when independent variables are highly correlated with each other,
making it difficult to isolate their individual effects.

9. How can you detect multicollinearity?


●​ Correlation matrix of predictors​

●​ Variance Inflation Factor (VIF): VIF > 10 indicates strong multicollinearity​

10. What is the difference between confidence intervals and prediction


intervals?
●​ Confidence interval: Range in which the mean value of the dependent variable is
expected to fall​
●​ Prediction interval: Range in which an individual prediction is expected to fall (wider)​

🔸 SECTION B: R Coding/Output-Based Questions


11. How do you run a simple linear regression in R?
R
CopyEdit
model <- lm(y ~ x, data = dataset)
summary(model)

12. How do you run a multiple linear regression in R?


R
CopyEdit
model <- lm(y ~ x1 + x2 + x3, data = dataset)
summary(model)

13. How do you extract the R-squared and Adjusted R-squared values in R?
R
CopyEdit
summary(model)$r.squared
summary(model)$adj.r.squared

14. How do you interpret regression coefficients?


Answer:​
Each coefficient shows the change in the dependent variable for a 1-unit change in the
corresponding independent variable, keeping other variables constant.

15. How do you check for heteroscedasticity in R?


R
CopyEdit
plot(model$fitted.values, model$residuals)
abline(h = 0, col = "red")
Or use:

R
CopyEdit
library(lmtest)
bptest(model) # Breusch-Pagan Test

16. How do you detect multicollinearity in R?


R
CopyEdit
library(car)
vif(model)

17. How do you calculate prediction and confidence intervals in R?


R
CopyEdit
# Confidence interval
predict(model, newdata = data.frame(x = 50), interval = "confidence")

# Prediction interval
predict(model, newdata = data.frame(x = 50), interval = "prediction")

18. How do you visualize a regression line?


R
CopyEdit
plot(dataset$x, dataset$y, main="Regression Line")
abline(model, col = "blue")

19. How do you check residual normality in R?


R
CopyEdit
qqnorm(model$residuals)
qqline(model$residuals)

20. What is the summary() function used for in regression?


Answer:​
It gives the regression output:

●​ Coefficients​

●​ R-squared values​

●​ F-statistic​

●​ p-values for significance testing​

21. How do you interpret the p-value in regression output?


Answer:​
If p-value < 0.05, the variable is statistically significant and likely affects the dependent
variable.
UNIT 3: Getting Started with R

1. What is R and why is it used in data analytics?


Answer:​
R is a programming language and software environment for statistical computing and
graphics.​
It is widely used in data science and analytics for:

●​ Data manipulation (via dplyr, data.table)​

●​ Statistical modeling (regression, hypothesis testing)​

●​ Data visualization (ggplot2, base R plots)​

●​ Machine learning (caret, randomForest)​

2. What are the advantages of R over other languages like Python or Excel?
Answer:

●​ Open source and free​

●​ Rich ecosystem of packages for specialized tasks​

●​ Great data visualization capabilities​

●​ Powerful for statistical modeling​

●​ Active community support​

●​ Integrates well with RStudio and markdown for reproducible reports​

3. What are packages and libraries in R?


Answer:

●​ A package is a collection of R functions, data, and documentation bundled together.​

●​ A library is where these packages are stored once installed.​


●​ You install a package using install.packages("package_name")​

●​ You load it into the current session using library(package_name)​

4. What are the primary data structures in R and where are they used?
Structure Description Use Case

Vector 1D homogeneous Store numeric/character


data

Matrix 2D homogeneous Mathematical operations

Array n-D Multidimensional data


homogeneous

List 1D Complex data (mix of types)


heterogeneous

Data Frame 2D Tabular data (like Excel)


heterogeneous

Factor Categorical data Used in statistical models

5. What is the difference between a vector and a list?


Answer:

●​ A vector holds elements of the same type (numeric, character, etc.)​

●​ A list can contain different types (e.g., numeric, character, vectors, even data frames)​

6. What are conditionals and control flows in R?


Answer:​
They are used to control execution based on conditions:

●​ if, else if, else → decision making​

●​ switch → for selecting among alternatives​

●​ Loops: for, while, repeat → for iteration​


7. What is the Apply family in R and why is it useful?
Answer:​
The apply() family provides efficient alternatives to loops:

●​ apply() – applies function to rows or columns of a matrix​

●​ lapply() – applies function to each element of a list​

●​ sapply() – simplified version of lapply, returns vector or matrix​

●​ tapply() – applies function over subsets of a vector​

●​ Advantage: More efficient and readable than loops​

UNIT 4: Descriptive Statistics Using R


1. What is descriptive statistics?
Answer:​
Descriptive statistics summarizes the basic features of data through numerical measures and
visualizations. It includes:

●​ Measures of central tendency (mean, median, mode)​

●​ Measures of dispersion (range, variance, standard deviation)​

●​ Shape of distribution (skewness, kurtosis)​

2. What is the difference between mean, median, and mode?


●​ Mean: Average​

●​ Median: Middle value​

●​ Mode: Most frequent value​


Use:​

●​ Use median for skewed data or outliers​


●​ Use mean for symmetric, normal data​

3. What are measures of dispersion and why are they important?


Answer:​
They measure variability or spread in data.

●​ Range: Difference between max and min​

●​ Variance: Average of squared differences from the mean​

●​ Standard Deviation: Square root of variance​


These help understand data reliability and consistency.​

4. Explain covariance and correlation.


Answer:

●​ Covariance: Measures direction of linear relationship​

○​ Positive → move together​

○​ Negative → move opposite​

●​ Correlation: Measures strength and direction, scaled from -1 to +1​

○​ Pearson’s correlation is most common​

○​ Correlation = Covariance / (SD₁ × SD₂)​

5. What is coefficient of determination (R²)?


Answer:​
It indicates how much of the variance in the dependent variable is explained by the
independent variable(s).

●​ R² ranges from 0 to 1​

●​ A higher R² indicates a better model fit​


6. What are common charts used for visualization and what do they show?
Chart Description

Histogram Distribution of numeric data

Bar Chart Frequency of categorical data

Box Plot Distribution with median, IQR, outliers

Line Graph Trend over time

Scatter Relationship between two variables


Plot

UNIT 5: Predictive Analytics (Excluding


Textual Analysis)
1. What is simple linear regression?
Answer:​
A statistical method to model a linear relationship between one independent and one
dependent variable.​
Equation:​
Y = β₀ + β₁X + ε

●​ β₀: intercept​

●​ β₁: slope​

●​ ε: error term​

2. What is multiple linear regression?


Answer:​
It models the relationship between a dependent variable and two or more independent
variables.​
Y = β₀ + β₁X₁ + β₂X₂ + ... + βnXn + ε
3. How do you interpret regression coefficients?
Answer:​
Each β coefficient represents the expected change in Y for a one-unit increase in the
respective X, holding all other Xs constant.

4. What are confidence intervals and prediction intervals?


●​ Confidence Interval: Range for the mean value of the dependent variable at a specific
X​

●​ Prediction Interval: Range for an individual predicted value​


Prediction intervals are always wider.​

5. What is heteroscedasticity? How does it affect regression?


Answer:

●​ When residuals have non-constant variance, i.e., spread increases with X​

●​ Violates OLS assumption → leads to inefficient, biased estimates​

●​ Detected using Breusch-Pagan test or residual plots​

6. What is multicollinearity and why is it a problem?


Answer:​
When independent variables are highly correlated, it becomes difficult to:

●​ Assess their individual effect​

●​ Leads to inflated standard errors​

●​ Detected using VIF (Variance Inflation Factor)​

7. What is the role of R² and Adjusted R² in regression?


●​ R²: Tells how much variance in Y is explained by Xs​
●​ Adjusted R²: Penalizes for adding irrelevant predictors​

●​ Use Adjusted R² for comparing models with different numbers of predictors​

8. What are some common pitfalls in regression modeling?


●​ Overfitting: Too many predictors fit the noise​

●​ Underfitting: Too few predictors miss the trend​

●​ Ignoring assumptions: leads to incorrect inference​

●​ Not checking residuals: violates model reliability​


PACKAGES:

Unit 3: Getting Started with R


Focus: Data structures, import, control flows, basic programming

Purpose Package Why It’s Needed

Importing Excel/CSV files readr, For reading .csv and .xlsx files
readxl

Data wrangling dplyr, For data manipulation and better data frame
tibble handling

Viewing data types and str, Built-in base functions (no need to install)
structure summary

✅ Installation code:
r
CopyEdit
install.packages(c("readr", "readxl", "dplyr", "tibble"))

✅ Unit 4: Descriptive Statistics Using R


Focus: Data description, central tendency, dispersion, visualizations

Purpose Package Why It’s Needed

Data visualization ggplot2 For histograms, bar charts, box plots,


etc.

Statistical summaries psych Provides functions like describe()

Correlation and stats Contains cor(), cov()


covariance (base)

Basic boxplots and plots Base R Functions like hist(), boxplot(),


plot()

✅ Installation code:
r
CopyEdit
install.packages(c("ggplot2", "psych"))

✅ Unit 5: Predictive Analytics (Regression)


Focus: Linear and multiple regression, diagnostics

Purpose Package Why It’s Needed

Simple & multiple linear stats (base lm() function is built-in


regression R)

Regression diagnostics car For VIF (variance inflation factor) and


multicollinearity

Plotting regression and ggplot2, For enhanced regression visuals


diagnostics GGally

Summary statistics & tests lmtest For Breusch-Pagan test (heteroscedasticity)

✅ Installation code:
r
CopyEdit
install.packages(c("car", "GGally", "lmtest"))

📦 Full Master Install Command


You can install everything at once using this combined command:

r
CopyEdit
install.packages(c("readr", "readxl", "dplyr", "tibble", "ggplot2",
"psych", "car", "GGally", "lmtest"))

You might also like